CN110728134A

CN110728134A - New word discovery method and device

Info

Publication number: CN110728134A
Application number: CN201810704548.2A
Authority: CN
Inventors: 谢群群; 邵荣防; 郝晖; 李萧萧; 张小卫
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2020-01-24

Abstract

The present disclosure provides a new word discovery method, including: acquiring search data of expert users in at least one field, wherein the search data comprises m search terms (m is more than or equal to 1), and the expert users are users whose behaviors and/or levels in the at least one field meet preset conditions; and selecting part of the search words as new word candidate words according to the number of search users and/or the search times of each search word in the m search words in the field to which the search word belongs.

Description

New word discovery method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a new word discovery method and apparatus.

Background

With the progress of technology and the increase of data volume, new words emerge continuously, new word recognition research becomes a very important problem in natural language processing, a new word discovery method can be applied to various fields, wherein the new word discovery method is particularly important for the e-commerce field needing to process a large amount of data, but the existing new word discovery calculation scheme is mainly obtained according to an artificial marking method, is based on dictionary data, depends on artificial marking, has small data volume, is difficult to be suitable for e-commerce application scenes, and particularly has numerous defects under the e-commerce application scene with higher timeliness requirement.

Disclosure of Invention

In view of the above, the present disclosure provides a new word discovery method, including: acquiring search data of expert users in at least one field, wherein the search data comprises m search terms (m is more than or equal to 1), and the expert users are users whose behaviors and/or levels in the at least one field meet preset conditions; and selecting part of the search words as new word candidate words according to the number of search users and/or the search times of each search word in the m search words in the field to which the search word belongs.

According to an embodiment of the present disclosure, the expert users in the at least one domain include expert users in a first domain, the expert users in the first domain including: users having a membership level in the first domain above a preset level threshold; and/or the number of purchased commodities in the first field is higher than a preset number threshold; and/or the number of times of purchasing commodities in the first field is higher than a preset number threshold value.

According to the embodiment of the disclosure, selecting a part of the search terms as new term candidate words according to the number of search users in the field to which each search term in the m search terms belongs includes: acquiring the number of searching users of a searching word in the field in a period of time, wherein the number of searching users is the number of expert users in the field in which the searching word is searched in the period of time; if the number of searching users of a searching word exceeds a preset searching user number threshold value, taking the searching word as a new word candidate word; and/or acquiring the number of all expert users of the search term in the field to which the search term belongs and the number of the search users of the search term in the field to which the search term belongs within a period of time; and if the ratio of the number of the search users to the number of all expert users in the field exceeds a preset ratio threshold, taking the search word as a new word candidate word.

According to an embodiment of the present disclosure, the new word discovery method further includes: the method comprises the steps of obtaining search data ranks of a plurality of users in a preset range, and extracting n search words (n is larger than or equal to 1), wherein the search data ranks comprise search user number ranks and/or search frequency ranks, the search user number ranks rank ranks the search words according to the number of the search users, and the search frequency ranks the search words according to the searched frequency; the method comprises the steps of obtaining ranking change conditions of each search word in n search words in a period of time in search user number ranking and/or search frequency ranking, and selecting partial search words as new word candidate words according to the ranking change conditions.

According to an embodiment of the present disclosure, the plurality of users within the preset range include a plurality of expert users and a plurality of general users; or the plurality of users in the preset range comprise a plurality of expert users; or the plurality of users in the preset range comprise a plurality of common users.

According to the embodiment of the disclosure, the obtaining of the ranking change condition of each search word in the n search words in the search user ranking and/or the search frequency ranking within a period of time, and the selecting of a part of the search words as new word candidate words according to the ranking change condition includes: obtaining a ranking of each search term before and after the period of time; obtaining a surge coefficient of each search term according to the ranking of the search terms before and after the period of time; and taking the search word with the violent coefficient exceeding a preset coefficient threshold value as a new word candidate word.

According to an embodiment of the present disclosure, the coefficients of surge are:

wherein R is_iFor ranking of search terms before the period of time, R_jRanking the search terms after the period of time.

According to an embodiment of the present disclosure, the new word discovery method further includes: the method comprises the steps of obtaining a plurality of commodity names and extracting new word candidate words from the commodity names.

According to an embodiment of the present disclosure, extracting a new word candidate word from the plurality of names of goods includes: cleaning the plurality of commodity names; performing word segmentation on the washed commodity name to obtain a plurality of word segmentation words; acquiring a head set and a tail set of each word-dividing word; obtaining the cohesion degree score of each word segmentation word according to the head set and the tail set; and taking the word segmentation words with the cohesion degree score exceeding a preset cohesion degree threshold value as new word candidate words.

According to an embodiment of the present disclosure, obtaining the cohesion score of each participle word according to the head set and the tail set includes: acquiring a header set cohesion score of the word segmentation words; acquiring the cohesive degree score of the tail set of the word segmentation words; obtaining brand scores of the participle words according to whether the participle words contain brand words and/or the positions of the brand words in the participle words; the cohesion scores of the participle words comprise a head set cohesion score, a tail set cohesion score and a brand score.

According to an embodiment of the present disclosure, extracting a new word candidate word from the plurality of names of goods further includes: and judging whether the word segmentation words contain preset service words or not, and if so, removing the word segmentation words.

The embodiment of the present disclosure provides a new word discovering apparatus, including: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring search data of expert users in at least one field, the search data comprises m search terms (m is more than or equal to 1), and the expert users are users whose behaviors and/or levels in a certain field meet preset conditions; and the first selection module is used for selecting part of the search words as new word candidate words according to the number of search users and/or the search times of each search word in the m search words in the field to which the search word belongs.

According to an embodiment of the present disclosure, the first selecting module includes: the user number module is used for acquiring the number of searching users of the search word in the field within a period of time, wherein the number of searching users is the number of expert users in the field in which the search word is searched within the period of time; if the number of searching users of a searching word exceeds a preset searching user number threshold value, taking the searching word as a new word candidate word; and/or the ratio module is used for acquiring the number of all expert users of the search term in the field to which the search term belongs and the number of the search users of the search term in the field to which the search term belongs within a period of time; and if the ratio of the number of the search users to the number of all expert users in the field exceeds a preset ratio threshold, taking the search word as a new word candidate word.

According to an embodiment of the present disclosure, the new word discovery apparatus further includes: the second acquisition module is used for acquiring search data ranks of a plurality of users in a preset range and extracting n search words (n is more than or equal to 1), wherein the search data ranks comprise search user number ranks and/or search frequency ranks, the search user number ranks rank ranks the search words according to the number of the search users, and the search frequency ranks the search words according to the searched frequency; and the second selection module is used for acquiring the ranking change condition of each search word in the n search words in the search user ranking and/or search frequency ranking within a period of time, and selecting part of the search words as new word candidate words according to the ranking change condition.

According to an embodiment of the present disclosure, the new word discovery apparatus further includes: the third acquisition module is used for acquiring a plurality of commodity names; and the third selection module is used for extracting new word candidate words from the plurality of commodity names.

The embodiment of the present disclosure further provides another new word discovery apparatus, including: one or more processors; a storage device to store one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the new word discovery method as described above.

Embodiments of the present disclosure also provide a computer-readable medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-mentioned new word discovery method.

According to the embodiment of the disclosure, the problems that the existing new word calculation method is based on dictionary data, depends on manual marking, has small data volume and is difficult to adapt to e-commerce application scenes can be at least partially solved, and therefore the technical effects that the requirements on the data volume, the timeliness and the accuracy can be met and the method is suitable for e-commerce application scenes can be achieved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates an exemplary system architecture of a new word discovery method and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a new word discovery method according to an embodiment of the disclosure;

fig. 3 schematically illustrates another flow chart for obtaining a new word candidate according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart for extracting new word candidates according to a ranking change according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of extracting a new word candidate in a commodity name according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of a new word discovery apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a first selection module according to an embodiment of the disclosure;

fig. 8 schematically shows a block diagram of a new word discovery apparatus according to another embodiment of the present disclosure;

fig. 9 schematically shows a block diagram of a new word discovery apparatus according to yet another embodiment of the present disclosure;

FIG. 10 schematically illustrates a block diagram of a third selection module according to an embodiment of the disclosure;

FIG. 11 schematically illustrates a relationship diagram between modules in accordance with an embodiment of the disclosure;

fig. 12 schematically shows a block diagram of a new word discovery apparatus according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, operations, and/or components, but do not preclude the presence or addition of one or more other features, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a device having at least one of A, B and C" would include but not be limited to devices having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a device having at least one of A, B or C" would include but not be limited to devices having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase "a or B" should be understood to include the possibility of "a" or "B", or "a and B".

An embodiment of the present disclosure provides a new word discovery method, including: acquiring search data of expert users in at least one field, wherein the search data comprises m search terms (m is more than or equal to 1), and the expert users are users whose behaviors and/or levels in the at least one field meet preset conditions; and selecting part of the search words as new word candidate words according to the number of search users and/or the search times of each search word in the m search words in the field to which the search word belongs.

According to the embodiment of the disclosure, the search data of the user can be processed to screen out the new word candidate words, the search data of the user can reflect the heat trend, the real-time search data of a large number of users can be sorted and sorted to quickly acquire comprehensive and real-time new word information, and the search data of certain groups of users can be screened out, for example, expert users acquire new word information in a certain field aiming at a specific group.

Fig. 1 schematically illustrates an exemplary system architecture 100 of a new word discovery method and apparatus according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services, and the server 105 may analyze and perform processing such as analysis on data such as a received user request and feed back a processing result (for example, information or data obtained or generated according to the user request) to the terminal device.

The server 105 may acquire as much search data as possible from a plurality of terminal devices and acquire a new word from the search data, or after acquiring search data of as many commodities as possible, the terminal devices acquire a new word from the search data and transmit the new word to the server for storage.

Server 105 may also be a cloud server, and/or a distributed cluster of servers. The server 105 can also collect, collate, process, analyze, and the like, various data generated by the user operating with the

terminal apparatuses

101, 102, 103.

It should be noted that the new word discovery method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the new word discovery apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The new word discovery method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the new word discovery apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Alternatively, the new word discovery method provided by the embodiment of the present disclosure may also be executed by the

terminal device

101, 102, or 103, or may also be executed by another terminal device different from the

terminal device

101, 102, or 103. Accordingly, the new word discovery apparatus provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103, or in another terminal device different from the

terminal device

101, 102, or 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a new word discovery method according to an embodiment of the present disclosure.

As shown in fig. 2, the new word discovery method of the embodiment of the present disclosure includes operations S200 to S300.

In operation S200, search data of expert users in at least one domain is obtained, where the search data includes m search terms (m ≧ 1), and the expert users are users whose behaviors and/or levels in the at least one domain satisfy a preset condition.

Specifically, the search data of the user may include one or more application programs on the terminal device or user search information on a webpage, for example, the search data of the user may be search data of a plurality of terminal users on a kyoto APP and a kyoto webpage, and the server 105 may obtain all or part of the search data of the user from the kyoto APP and the kyoto webpage of the plurality of terminals, perform statistics on the search data, and obtain a plurality of search terms from the search data according to a certain selection method.

The expert user may be a user whose behavior and/or level in a certain field or a plurality of fields meet a preset condition, for example, a user whose subject member level in a certain field or a plurality of fields is high and who has an excessive number of purchase records, and the fields may include photography, drawing, 3C, games, and the like.

According to an embodiment of the present disclosure, the expert users in the at least one domain include expert users in a first domain, and the expert users in the first domain include: users having a membership level in the first domain higher than a preset level threshold; and/or the number of purchased goods in the first domain is higher than a preset number threshold; and/or a user who purchases goods in the first area more than a preset threshold number of times.

Specifically, the first domain may be, for example, a 3C domain, and the kyoto may evaluate, according to a certain criterion, a subject member level of the user in the domain according to a reference of a purchasing behavior, a browsing and searching behavior, or a comment attention behavior of the user in the 3C domain, and when the member level of the user exceeds a preset level threshold, consider the user as an expert user. Or, the user may be considered as an expert user if the user has a large number of purchase records in the field and the number of purchases or the number of purchased goods passes a certain threshold.

The obtaining of the search data of the expert users in at least one field may be, for example, extracting search data of the expert users in each field in one day or one month according to different fields divided by the kyoto, where the search data includes m search terms, and then calculating the number of search users and the number of search times of the search terms in each field.

In operation S300, according to the number of search users and/or the number of search times of each search term in the field to which each search term in the m search terms belongs, a part of the search terms is selected as a new term candidate.

Specifically, the number of search users in the field to which each search term belongs may refer to the number of expert users who have searched the search term in the field, for example, the user a is an expert user in the 3C field, the user a has searched for the "full-screen cell phone" within the statistical time, the "full-screen cell phone" is one search term of the m search terms, and the number of all expert users in the field who have searched for the "full-screen cell phone" within the statistical 3C field may be used as the number of search users of the search term "full-screen cell phone".

The number of searches in the corresponding field for each search term may refer to the number of times the search term is searched by an expert user in the field within a statistical time period.

And after the number of searching users and/or the searching times of each searching word in the field to which the searching word belongs are/is obtained, selecting partial searching words as new word candidate words according to a preset rule.

In the embodiment of the disclosure, search data of users can be processed to screen out new word candidate words, the search data of users can reflect heat trend, search data of users of certain groups can be screened out, new word information in a certain field can be obtained for a specific group, for example, search data of experts in the same industry is extracted according to different industries, words with high frequency of occurrence are used as new word candidate words, and thus, new words in a professional field can be analyzed by arranging expert knowledge. Therefore, the new word discovery method of the embodiment of the disclosure can at least partially solve the problems that the existing new word calculation method is based on dictionary data, depends on manual marking, has small data volume and is difficult to adapt to e-commerce application scenes, and can achieve the technical effects that the requirements on data volume, timeliness and accuracy can be met, and the method is suitable for e-commerce application scenes.

According to an embodiment of the present disclosure, operation S300 may include operation S310 and/or operation S320:

in operation S310, acquiring the number of search users in the field to which the search word belongs within a period of time, where the number of search users is the number of expert users in the field to which the search word has been searched within the period of time; and if the number of the search users of the search word exceeds a preset search user number threshold value, taking the search word as a new word candidate word.

In operation S320, acquiring the number of all expert users of the search term in the belonging field and the number of search users of the search term in the belonging field for a period of time; and if the ratio of the number of the search users to the number of all expert users in the field exceeds a preset ratio threshold, taking the search word as a new word candidate word.

Specifically, the new word candidate may be screened out by performing the above operation S310 and/or operation S320 on each search word in the m search words, taking the search word "full screen mobile phone" as an example, the search word "full screen mobile phone" belongs to the 3C field, obtains the number of all expert users in the 3C field, and obtains the number of search users of "full screen mobile phone" in the statistical time. If the number of search users of the search word "full screen mobile phone" exceeds a preset user number threshold value, and/or the ratio of the number of search users to the number of all expert users in the 3C field exceeds a preset ratio threshold value, then the search word "full screen mobile phone" is used as a new word candidate word, according to experience, the search user number threshold value can be set to 100, the ratio threshold value can be set to 0.01, and the following formula can be specifically used for representing:

wherein, W_iSearch is the number of expert users who Search for the word i in the domain_totalThe number of all expert users in the corresponding field.

Fig. 3 schematically shows another flowchart for obtaining a new word candidate according to an embodiment of the present disclosure.

As shown in fig. 3, the new word discovery method may further include operations S400 to S500 according to an embodiment of the present disclosure.

In operation S400, search data ranks of a plurality of users within a preset range are obtained, and n search terms (n is greater than or equal to 1) are extracted, where the search data ranks include a search user number rank and/or a search frequency rank, the search user number rank ranks the search terms according to the search user number, and the search frequency rank ranks the search terms according to the searched times.

In operation S500, a ranking change condition of each of the n search terms in the search user ranking and/or the search frequency ranking within a period of time is obtained, and a part of the search terms is selected as a new term candidate according to the ranking change condition.

According to an embodiment of the present disclosure, the plurality of users within the preset range includes a plurality of expert users and a plurality of general users; or the plurality of users within the preset range comprise a plurality of expert users; or the plurality of users within the preset range comprises a plurality of common users.

Specifically, the plurality of users within the preset range may include expert users and general users, and may be all registered users, for example. Or the plurality of users within the preset range may include only expert users, for example, may be all expert users. Or the plurality of users within the preset range only include the common users, for example, all the common users may be included. The users in different ranges can obtain different search data through statistics, and the users in different ranges can be selected according to actual conditions.

In the embodiment of the present disclosure, by taking the search data of all the jingdong users in a period of time as an example, all the jingdong users may include all users who have registered the jingdong account and users who have not registered an account but have used a search function, where the period of time may be, for example, one hour, the search data may be a search word, and the search word may include characters such as chinese characters, letters, numbers, or symbols.

And counting the search data of all users in the period of time and ranking the search data of the users, wherein the ranking can be ranking the search words according to the searched times of the search words in the period of time to obtain the rank of the search times, and the ranking can also be ranking the search words according to the number of the users who searched the search words in the period of time to obtain the rank of the number of the search users. For example, the number of searches in one hour obtained by the above manner is ranked as follows: the mobile phone comprises a tablet computer, a mobile phone, an air purifier, a curved screen television and a full-screen mobile phone. After the ranks of the search terms are obtained, n search terms located at the top n positions may be selected, and for example, the top 10 ten thousand search terms may be taken out from the ranks for subsequent operations.

And calculating the ranking change rate of each search word in the top 10 ten thousand search words in a period of time, and selecting partial search words as new word candidate words according to the ranking change condition of each search word.

Before the following operations, the search terms can be cleaned, namely, operations such as removing special characters, unifying the special characters into lower-case characters and simplified characters, and the like are performed, so that the search terms are more normalized, and the following operations are facilitated.

Fig. 4 schematically shows a flowchart of extracting a new word candidate word according to a ranking change condition according to an embodiment of the present disclosure.

As shown in fig. 4, operation S500 may include operations S510 to S530:

in operation S510, a ranking of each search term before and after a period of time is obtained;

in operation S520, a soaring coefficient of each search term is obtained according to the rankings of the search terms before and after a period of time;

in operation S530, the search word with the surging coefficient exceeding the preset coefficient threshold is used as a new word candidate word.

Specifically, the period of time may be one hour, the rank of each search term in 10 ten thousand search terms before and after one hour is obtained, and the rank before one hour is denoted as R_iRank one hour later is denoted as R_j。

Obtaining a surge coefficient of each search term according to the ranking of the search terms before and after a period of time, wherein the surge coefficient can be used for representing the rising speed of the search volume of the search terms in a short time, and the surge coefficient is Score_hotThe following formula can be used for calculation:

wherein R is_iFor ranking of search terms some time ago, R_jRanking the search terms after a period of time.

And taking the search word with the violent coefficient exceeding a preset coefficient threshold as a new word candidate word, wherein the preset coefficient threshold can be set to be 0.45 according to experience.

In the embodiment of the disclosure, the comprehensive and real-time new word information can be quickly acquired by sorting the real-time search data of a large number of users, the words with the quickly increased short-time search quantity are calculated by using the search data of the large number of users and serve as the candidate words of the new words, the suddenness words can reflect the hot words in a short time, the data volume of the search data is large, and the new word information can be quickly and comprehensively acquired.

According to an embodiment of the present disclosure, the new word discovery method may further include operation S600:

in operation S600, a plurality of names of goods are obtained, and a candidate word for a new word is extracted from the plurality of names of goods.

Specifically, in addition to the new word candidate word obtained by searching data, a new word candidate word may be obtained by analyzing names of a large number of commodities. The multiple commodities can be all or part of commodities on shelves in the Jingdong mall, names of the commodities are obtained and processed, and new word candidate words are extracted according to a preset rule.

Fig. 5 schematically shows a flowchart of extracting a new word candidate word in a commodity name according to an embodiment of the present disclosure.

As shown in fig. 5, operation S600 includes operations S610 to S650:

in operation S610, a plurality of names of commodities are washed.

Specifically, the cleaning of the plurality of names of commodities includes the following operations:

(1) replacing special characters in the trade name with spaces, e.g., [ ￥% … () ] ":/> < etc.;

(2) removing the pure numeric string in the commodity name;

(3) changing a plurality of spaces in the commodity name into one space;

(4) removing invisible characters in the commodity name;

(5) unifying characters in the commodity name into lower case;

(6) and uniformly converting traditional characters in the commodity name into simplified characters.

In operation S620, the washed product name is segmented to obtain a plurality of segmented words.

Specifically, sliding window word segmentation is performed on the washed commodity name, taking the name of 'full-screen mobile phone jingdong self-operation' as an example, sliding window word segmentation is performed on 2-5 words, and firstly, a combination of 2 words is sequentially extracted from left to right from the name: the mobile phone comprises a mobile phone body, a mobile phone handle, a mobile phone handset, a mobile phone and a mobile phone. Secondly, extracting the combination of 3 words from left to right in sequence: the mobile phone comprises a full screen, a face screen hand, a screen mobile phone and a mobile phone. And similarly, extracting to obtain a combined word of 4 words and 5 words.

In operation S630, a head set and a tail set of each participle word are obtained.

Specifically, forExtracting word-dividing words, recording a head and tail single character data set of the words, wherein the words do not need to be de-duplicated, a head set of the words can refer to a set of characters positioned on the left side of the words in commodity names, a tail set of the words can refer to a set of characters positioned on the right side of the words in commodity names, taking word-dividing words such as a mobile phone, a commodity named ' full-screen mobile phone from Beijing Dong's republic of China in the commodities on the shelf of Beijing Dong, the head words of the mobile phone are ' screens ', the tail words are ' Beijing ', and a head set D of the mobile phone ' can be obtained by searching all the commodities on the shelf in the same way_headAnd tail set D_tailIt should be noted that if the participle word is located at the beginning of the product name, the head word is a space, for example, "mobile phone" is used as the beginning of the product name, "mobile phone 128G red envelope mobile phone", and the head word is a space, and if the participle word is located at the end of the product name, the tail word is a space, for example, "mobile phone" is used as the end of the product name, "red 128G envelope mobile phone", and the tail word is a space_headLine, new, screen, hand, color, space, G, hand,_tailshell, membrane, sleeve, space, membrane, screen,. red, green }.

Between operations S630 and S640, operation S660 may further be included: and judging whether the word segmentation words contain preset service words or not, and if so, removing the word segmentation words.

Specifically, the preset service words may be e-commerce service words, such as e.g. special business words like "package post, express delivery, slow-decrease, sales promotion, promotion", etc., and the participle words including the preset service words are removed. It should be noted that word segmentation words containing business scene word data such as "self-service in the kyoto, 211 on-time arrival, supermarket in the kyoto, fresh living in the kyoto, etc" do not need to be removed.

This operation may be performed in other positions in operations S620 to S650.

In operation S640, obtaining a cohesion score of each word segmentation term according to the header set and the tail set;

according to an embodiment of the present disclosure, operation S640 includes operations S641 to S643:

in operation S641, a header set cohesion score of the segmented word is obtained.

First, a header set D of word-segmented words is extracted_headAnd calculating the cohesion score for the spaces:

wherein the number of occurrences of a space is defined as a header set D_headThe number of medium spaces and the number of times of all characters refer to a header set D_headThe number of all characters in (a).

Second, compute the set of headers D_headThe calculation formula of the information entropy score in (1) is as follows:

wherein, P_iThe probability (excluding space) of the ith character in the set is shown, n is the number of all characters except space in the head set, and the calculation mode is as follows:

the number of occurrences of the character I refers to the header set D_headThe number of the middle characters I and the total number of all the characters refer to a header set D_headThe number of all characters except the space.

Therefore, the header set cohesion of the word-segmented word is:

Score_head＝Score_{blank_head}+Score_{E_head}

in operation S642, a tail set cohesion score of the segmented word is obtained.

Similar to operation S641, the cohesive Score of the tail set of the participle word may be calculated_tail。

In operation S643, a brand score of a segmented word is obtained according to whether the segmented word includes a brand word and/or a position of the brand word in the segmented word.

Further, if the participle word begins or ends with a brand word, the cohesion of the participle word is scored. The brand word may be, for example, a commodity brand such as "hua is", for example, if the participle word "hua is a mobile phone" starts with the brand word "hua is", then the participle word "hua is a mobile phone" is subjected to brand scoring, the brand score of the participle word may be set to 3-7 according to experience, and may be performed by, for example, the following formula:

the cohesion scores of the final participle words include a head set cohesion score, a tail set cohesion score, and a brand score:

Score_total＝Score_head+Score_tail+Score_brand

in operation S650, the word segmentation word with the cohesion score exceeding the preset cohesion threshold is used as a new word candidate word.

Specifically, the cohesion threshold value can be set to be 6-10 empirically, for example, when the brand score is set to be 5, word segmentation words with the cohesion score exceeding 6 are used as new word candidate words.

In the embodiment of the disclosure, candidates are extracted from names of commodities, and candidate word data is generated by a method of calculating degrees of polymerization before and after the candidates. And special processing is carried out on words such as specific words (such as brand words) in the E-commerce field and E-commerce service words (such as package postings and sales promotion) so as to be suitable for the E-commerce field.

After the candidate new word is obtained by the method, whether the candidate new word is the final new word can be finally confirmed in a manual evaluation mode, and a new word list is generated for each service.

The embodiment of the disclosure also provides a new word discovering device.

Fig. 6 schematically shows a block diagram of a new word discovery apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the new word discovery apparatus 700 may include a first obtaining module 710 and a first selecting module 720, wherein:

the first obtaining module 710 is configured to obtain search data of expert users in at least one field, where the search data includes m search terms (m ≧ 1), and the expert users are users whose behaviors and/or levels in a certain field meet preset conditions.

Specifically, the first obtaining module 710 may perform the operation S200 described above, for example, and is not described herein again.

The first selecting module 720 is configured to select some search terms as new term candidates according to the number of search users and/or the number of search times of each search term in the field to which each search term in the m search terms belongs. .

Specifically, the first selecting module 720 may perform the operation S300 described above, for example, and is not described herein again.

FIG. 7 schematically shows a block diagram of a first selection module according to an embodiment of the disclosure.

As shown in fig. 7, the first selection module 720 may include a user number module 721 and a ratio module 722 according to an embodiment of the present disclosure.

The user number module 721 is configured to obtain the number of search users in the domain where the search word belongs within a period of time, where the number of search users is the number of expert users in the domain where the search word has been searched within the period of time; and if the number of the search users of the search word exceeds a preset search user number threshold value, taking the search word as a new word candidate word.

Specifically, the user number module 721 may perform the operation S310 described above, for example, and is not described herein again.

The ratio module 722 is used for acquiring the number of all expert users in the field to which the search term belongs and the number of search users in the field to which the search term belongs within a period of time; and if the ratio of the number of the search users to the number of all expert users in the field exceeds a preset ratio threshold, taking the search word as a new word candidate word.

Specifically, the ratio module 722 may perform the operation S320 described above, for example, and is not described herein again.

Fig. 8 schematically shows a block diagram of a new word discovery apparatus according to another embodiment of the present disclosure.

As shown in fig. 8, according to an embodiment of the present disclosure, the new word discovery apparatus 700 may further include a second obtaining module 730 and a second selecting module 740, where:

the second obtaining module 730 is configured to obtain search data ranks of a plurality of users within a preset range, and extract n search terms (n is greater than or equal to 1) from the search data ranks, where the search data ranks include a search user number rank and/or a search frequency rank, the search user number rank ranks the search terms according to the search user number, and the search frequency rank ranks the search terms according to the searched frequency.

Specifically, the second obtaining module 730 may perform the operation S400 described above, for example, and is not described herein again.

The second selecting module 740 is configured to obtain a ranking change condition of each search term in the n search terms in the search user ranking and/or the search frequency ranking within a period of time, and select a part of the search terms as new term candidate terms according to the ranking change condition.

Specifically, the second selecting module 740 may perform the operation S500 described above, for example, and is not described herein again.

Fig. 9 schematically shows a block diagram of a new word discovery apparatus according to yet another embodiment of the present disclosure.

As shown in fig. 9, according to an embodiment of the present disclosure, the new word discovery apparatus 700 may further include a third obtaining module 750 and a third selecting module 760, where:

the third obtaining module 750 is configured to obtain a plurality of names of commodities;

the third selection module 760 is configured to extract a candidate word of a new word from the plurality of names of the goods.

Specifically, the third obtaining module 750 and the third selecting module 760 may, for example, perform the operation S600 described above, which is not described herein again.

FIG. 10 schematically shows a block diagram of a third selection module according to an embodiment of the disclosure.

As shown in fig. 10, the third selection module 760 may include a cleaning module 761, a word segmentation module 762, and a cohesion module 763, wherein:

the cleaning module 761 is used for cleaning a plurality of names of commodities. Specifically, the cleaning module 761 can perform the operation S610 described above, for example, and will not be described herein again.

The word segmentation module 762 is configured to segment words of the washed commodity name to obtain a plurality of segmented words. Specifically, the word segmentation module 762 may perform the operation S620 described above, for example, and is not described herein again.

The cohesion degree module 763 is configured to obtain a head set and a tail set of each word segmentation word, obtain a cohesion degree score of each word segmentation word according to the head set and the tail set, and use a word segmentation word with the cohesion degree score exceeding a preset cohesion degree threshold as a new word candidate word. Specifically, the cohesion degree module 763 may perform the operations S630 to S650 described above, for example, and will not be described herein again.

FIG. 11 schematically shows a relationship diagram between modules according to an embodiment of the disclosure.

As shown in fig. 11, the new word discovering apparatus may further include a new word confirming module 770 for receiving the first selecting module 720, the second selecting module 740, and the third selecting module 760, and selecting a real new word from the new word candidates according to a preset rule.

It is understood that the first obtaining module 710, the first selecting module 720, the number of users module 721, the ratio module 722, the second obtaining module 730, the second selecting module 740, the third obtaining module 750, the third selecting module 760, the cleaning module 761, the word segmentation module 762 and the cohesion degree module 763 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present invention, at least one of the first obtaining module 710, the first selecting module 720, the user number module 721, the ratio module 722, the second obtaining module 730, the second selecting module 740, the third obtaining module 750, the third selecting module 760, the cleaning module 761, the word segmentation module 762 and the cohesion degree module 763 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented in a suitable combination of three implementations of software, hardware and firmware. Alternatively, at least one of the first obtaining module 710, the first selecting module 720, the number of users module 721, the ratio module 722, the second obtaining module 730, the second selecting module 740, the third obtaining module 750, the third selecting module 760, the cleaning module 761, the word segmentation module 762 and the cohesion degree module 763 may be at least partially implemented as a computer program module, and when the program is executed by a computer, the function of the corresponding module may be executed.

Fig. 12 schematically shows a block diagram of a new word discovery apparatus according to an embodiment of the present disclosure. The computer system illustrated in FIG. 12 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.

As shown in fig. 12, a computer system 1500 according to an embodiment of the present disclosure includes a processor 1501 which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1502 or a program loaded from a storage section 1508 into a Random Access Memory (RAM) 1503.

Processor 1501 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset(s) and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and so forth. The processor 1501 may also include on-board memory for caching purposes. Processor 1501 may include a single processing unit or multiple processing units for performing the different actions of the method flows described with reference to fig. 2-5 in accordance with embodiments of the present disclosure.

In the RAM1503, various programs and data necessary for the operation of the system 1500 are stored. The processor 1501, the ROM1502, and the RAM1503 are connected to each other by a bus 1504. The processor 1501 executes various operations of the new word discovery method described above with reference to fig. 2 to 5 by executing programs in the ROM1502 and/or the RAM 1503. Note that the programs may also be stored in one or more memories other than the ROM1502 and RAM 1503. The processor 1501 may also execute the various operations of the new word discovery method described above with reference to fig. 2-5 by executing programs stored in the one or more memories.

In accordance with an embodiment of the present disclosure, system 1500 may also include an input/output (I/O) interface 1505, input/output (I/O) interface 1505 also connected to bus 1504. The system 1500 may also include one or more of the following components connected to the I/O interface 1505: an input portion 1506 including a keyboard, a mouse, and the like; an output portion 1507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1508 including a hard disk and the like; and a communication section 1509 including a network interface card such as a LAN card, a modem, or the like. The communication section 1509 performs communication processing via a network such as the internet. A drive 1510 is also connected to the I/O interface 1505 as needed. A removable medium 1511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1510 as necessary, so that a computer program read out therefrom is mounted into the storage section 1508 as necessary.

According to an embodiment of the present disclosure, the method described above with reference to the flow chart may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1509, and/or installed from the removable medium 1511. The computer program, when executed by the processor 1501, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing. According to embodiments of the present disclosure, a computer-readable medium may include the ROM1502 and/or RAM1503 described above and/or one or more memories other than the ROM1502 and RAM 1503.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The above-mentioned computer-readable medium carries one or more programs which, when executed by one of the apparatuses, cause the apparatus to perform a new word discovery method according to an embodiment of the present disclosure. The target table includes at least one record, each record including at least one entry, different entries corresponding to different indices. The method comprises the following steps: acquiring search data of expert users in at least one field, wherein the search data comprises m search terms (m is more than or equal to 1), and the expert users are users whose behaviors and/or levels in the at least one field meet preset conditions; and selecting part of the search words as new word candidate words according to the number of search users and/or the search times of each search word in the m search words in the field to which the search word belongs.

Claims

1. A new word discovery method, comprising:

acquiring search data of expert users in at least one field, wherein the search data comprises m search terms (m is more than or equal to 1), and the expert users are users whose behaviors and/or levels in the at least one field meet preset conditions;

and selecting part of the search words as new word candidate words according to the number of search users and/or the search times of each search word in the m search words in the field to which the search word belongs.

2. The new word discovery method of claim 1, wherein the expert users in the at least one domain include expert users in a first domain, the expert users in the first domain including:

users having a membership level in the first domain above a preset level threshold; and/or

The number of purchased commodities in the first field is higher than a preset number threshold; and/or

And the number of times of purchasing commodities in the first field is higher than a preset number threshold value.

3. The new word discovery method according to claim 1, wherein selecting a part of the m search words as new word candidate words according to the number of search users in the field to which each search word belongs comprises:

acquiring the number of searching users of a searching word in the field in a period of time, wherein the number of searching users is the number of expert users in the field in which the searching word is searched in the period of time; if the number of searching users of a searching word exceeds a preset searching user number threshold value, taking the searching word as a new word candidate word; and/or

Acquiring the number of all expert users of a search term in the field and the number of search users of the search term in the field within a period of time; and if the ratio of the number of the search users to the number of all expert users in the field exceeds a preset ratio threshold, taking the search word as a new word candidate word.

4. The new word discovery method according to claim 1, further comprising:

the method comprises the steps of obtaining search data ranks of a plurality of users in a preset range, and extracting n search words (n is larger than or equal to 1), wherein the search data ranks comprise search user number ranks and/or search frequency ranks, the search user number ranks rank ranks the search words according to the number of the search users, and the search frequency ranks the search words according to the searched frequency;

the method comprises the steps of obtaining ranking change conditions of each search word in n search words in a period of time in search user number ranking and/or search frequency ranking, and selecting partial search words as new word candidate words according to the ranking change conditions.

5. The new word discovery method according to claim 4, wherein:

the plurality of users in the preset range comprise a plurality of expert users and a plurality of common users; or

The plurality of users within the preset range comprise a plurality of expert users; or

The plurality of users within the preset range include a plurality of common users.

6. The method for discovering new words according to claim 4, wherein the obtaining of the ranking change condition of each of the n search words in the user number ranking and/or the number of search times ranking within a period of time, and the selecting of a part of the search words as new word candidate words according to the ranking change condition comprises:

obtaining a ranking of each search term before and after the period of time;

obtaining a surge coefficient of each search term according to the ranking of the search terms before and after the period of time;

and taking the search word with the violent coefficient exceeding a preset coefficient threshold value as a new word candidate word.

7. The new word discovery method according to claim 5, wherein the surge coefficient is:

8. The new word discovery method according to claim 1, further comprising:

the method comprises the steps of obtaining a plurality of commodity names and extracting new word candidate words from the commodity names.

9. The new word discovery method according to claim 8, wherein extracting new word candidates among the plurality of names of commodities includes:

cleaning the plurality of commodity names;

performing word segmentation on the washed commodity name to obtain a plurality of word segmentation words;

acquiring a head set and a tail set of each word-dividing word;

obtaining the cohesion degree score of each word segmentation word according to the head set and the tail set;

and taking the word segmentation words with the cohesion degree score exceeding a preset cohesion degree threshold value as new word candidate words.

10. The neologism discovery method of claim 9, wherein,

obtaining the cohesion degree score of each word segmentation word according to the head set and the tail set comprises the following steps:

acquiring a header set cohesion score of the word segmentation words;

acquiring the cohesive degree score of the tail set of the word segmentation words;

obtaining brand scores of the participle words according to whether the participle words contain brand words and/or the positions of the brand words in the participle words;

the cohesion scores of the participle words comprise a head set cohesion score, a tail set cohesion score and a brand score.

11. The new word discovery method according to claim 9, wherein extracting new word candidates among the plurality of names of commodities further comprises:

and judging whether the word segmentation words contain preset service words or not, and if so, removing the word segmentation words.

12. A new word discovery apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring search data of expert users in at least one field, the search data comprises m search terms (m is more than or equal to 1), and the expert users are users whose behaviors and/or levels in a certain field meet preset conditions;

and the first selection module is used for selecting part of the search words as new word candidate words according to the number of search users and/or the search times of each search word in the m search words in the field to which the search word belongs.

13. The apparatus according to claim 12, wherein the first extracting means comprises:

the user number module is used for acquiring the number of searching users of the search word in the field within a period of time, wherein the number of searching users is the number of expert users in the field in which the search word is searched within the period of time; if the number of searching users of a searching word exceeds a preset searching user number threshold value, taking the searching word as a new word candidate word; and/or

The ratio module is used for acquiring the number of all expert users of the search term in the field to which the search term belongs and the number of the search users of the search term in the field to which the search term belongs within a period of time; and if the ratio of the number of the search users to the number of all expert users in the field exceeds a preset ratio threshold, taking the search word as a new word candidate word.

14. The neologism discovery apparatus of claim 12, further comprising:

the second acquisition module is used for acquiring search data ranks of a plurality of users in a preset range and extracting n search words (n is more than or equal to 1), wherein the search data ranks comprise search user number ranks and/or search frequency ranks, the search user number ranks rank ranks the search words according to the number of the search users, and the search frequency ranks the search words according to the searched frequency;

and the second selection module is used for acquiring the ranking change condition of each search word in the n search words in the search user ranking and/or search frequency ranking within a period of time, and selecting part of the search words as new word candidate words according to the ranking change condition.

15. The neologism discovery apparatus of claim 14, further comprising:

the third acquisition module is used for acquiring a plurality of commodity names;

and the third selection module is used for extracting new word candidate words from the plurality of commodity names.

16. A new word discovery apparatus comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-11.

17. A computer readable medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 11.