CN111428486A - Article information data processing method, apparatus, medium, and electronic device - Google Patents

Article information data processing method, apparatus, medium, and electronic device Download PDF

Info

Publication number
CN111428486A
CN111428486A CN201910016030.4A CN201910016030A CN111428486A CN 111428486 A CN111428486 A CN 111428486A CN 201910016030 A CN201910016030 A CN 201910016030A CN 111428486 A CN111428486 A CN 111428486A
Authority
CN
China
Prior art keywords
category
product
product words
word
categories
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910016030.4A
Other languages
Chinese (zh)
Other versions
CN111428486B (en
Inventor
安旭
安伟佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN201910016030.4A priority Critical patent/CN111428486B/en
Publication of CN111428486A publication Critical patent/CN111428486A/en
Application granted granted Critical
Publication of CN111428486B publication Critical patent/CN111428486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for processing article information data, a computer readable medium and electronic equipment, which relate to the technical field of computers and comprise the following steps: determining the product words of the articles according to the word segmentation result of the title sentences of the articles under the category; acquiring the similarity between the product words and corresponding articles; acquiring the sum of the similarity of each product word corresponding to different articles under the category; selecting product words of the category according to the sorting of the sum values under the category; and performing category matching processing according to the product words of the categories so as to determine the matching categories. According to the technical scheme of the embodiment of the invention, the similarity between the product words and the articles is calculated, and the product words are selected according to the sum of the similarity for category matching processing, so that the product words in the misrecognized title sentences are removed, and the accuracy of category mapping is improved.

Description

Article information data processing method, apparatus, medium, and electronic device
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for processing article information data, a computer readable medium and electronic equipment.
Background
In the case that the objects belong to different categories, category mapping is usually performed first when matching the objects, so as to reduce the magnitude of matching. The category mapping is to find corresponding categories in the full-category articles according to the category mapping relation.
In the related art, the mapping judgment of categories is generally performed by taking intersection from product words. When enough product words overlap in two categories, the two categories of the articles are considered to be the articles with certain same attributes, for example, the two categories have the product words of 'white spirit', 'wine' and the like, and the articles in the two categories are related to the 'wine', and obviously have a category mapping relation.
When selecting the product words representing the categories, firstly finding out the product words in the titles of all the articles in each category, sequencing the product words according to the occurrence frequency, and taking M product words with the minimum sequence number to represent the category. And if the M product words with the minimum sequence numbers of the two categories have intersection, the two categories are considered to have a mapping relation.
However, when the product words representing the categories are selected, the dictionary method is needed to identify the product words, and the method can cause the situation of misrecognition of the product words, thereby affecting the accuracy of the category mapping.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.
Disclosure of Invention
Embodiments of the present invention provide a method and an apparatus for processing article information data, a computer-readable medium, and an electronic device, so as to overcome, at least to a certain extent, a technical problem that a product word is mistakenly recognized when selected.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to a first aspect of the embodiments of the present invention, there is provided an article information data processing method, including: determining the product words of the articles according to the word segmentation result of the title sentences of the articles under the category; acquiring the similarity between the product words and corresponding articles; acquiring the sum of the similarity of each product word corresponding to different articles under the category; selecting product words of the category according to the sorting of the sum values under the category; and performing category matching processing according to the product words of the categories so as to determine the matching categories.
In one embodiment, the obtaining of the similarity between the product word and the corresponding article includes: training a word vector model by using the title sentence of the article to obtain a word vector of the product word; obtaining a sentence vector of the title sentence of the article; and obtaining the cosine similarity of the word vector and the sentence vector.
In one embodiment, the obtaining a sentence vector of a title sentence of the article includes: and obtaining the sentence vector by adding word vectors of all product words of the title sentence on corresponding dimensions and then taking a mean value.
In one embodiment, before selecting the product word according to the sorting of the sum values under categories, the method further comprises: sorting the sum of the similarity of different articles corresponding to each product word under the category; selecting the product words of the category according to the sorting of the sum values under the category comprises the following steps: and selecting N product words before the sorting of the sum under the category as the product words of the category, wherein the ratio of N to the number of all the product words under the category is a set first numerical value.
In one embodiment, the performing category matching processing according to the product words of the category to determine a matching category includes: and judging whether the categories are matched according to whether the product words of the categories have intersection.
According to a second aspect of the embodiments of the present invention, there is provided an article information data processing apparatus including: the determining unit is used for determining the product words of the articles according to the word segmentation results of the title sentences of the articles under the categories; the first acquisition unit is used for acquiring the similarity between the product words and corresponding articles; the second acquisition unit is used for acquiring the sum of the similarity of different articles corresponding to each product word under the category; the selecting unit is used for selecting the product words of the categories according to the sorting of the sum values under the categories; and the matching unit is used for performing category matching processing according to the product words of the categories so as to determine the matching categories.
In one embodiment, the apparatus further includes a sorting unit configured to sort a sum of similarity values of different items corresponding to each of the product words in the category; the selecting unit is further configured to select N product words before the sum in the category is ranked as the product words in the category, where a ratio of N to the number of all product words in the category is a set first numerical value.
In one embodiment, the matching unit is further configured to determine whether the categories match according to whether the product words of the categories intersect.
According to a third aspect of embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing the item information data processing method according to the first aspect of the embodiments described above.
According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the item information data processing method according to the first aspect of the embodiments.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
in the technical scheme provided by some embodiments of the invention, the similarity between the product words and the articles is calculated, and the product words are selected according to the sum sequence of the similarity for category matching processing, so that the product words in the misrecognized title sentences are removed, and the accuracy of category mapping is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 schematically shows a flow chart of an item information data processing method according to an embodiment of the present invention;
fig. 2 schematically shows a flowchart of an item information data processing method according to another embodiment of the present invention;
FIG. 3 schematically shows a block diagram of an article information data processing apparatus according to an embodiment of the present invention;
fig. 4 schematically shows a block diagram of an article information data processing apparatus according to another embodiment of the present invention;
FIG. 5 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In the related art, we employ a dictionary-based word segmentation method to determine product words for an item. If there are multiple product words in the title sentence of an item, it is often not possible to effectively identify which one or ones are the valid product words describing the item. If the product word of the selected article includes a product word irrelevant to the article, since the product word cannot effectively describe the article, a wrong mapping situation may occur during subsequent article category mapping.
For example, a title of liquor may have a description of "gift bag" that is misidentified as a product word. At this time, the similarity between the gift bag and the article is low, the similarity between the white spirit and the article is high, the ambiguity can be eliminated by considering the similarity, and selecting the white spirit with high similarity to the article as the product word of the article, so that the mistakenly identified product word gift bag is removed.
Based on the above analysis, in the exemplary embodiment of the present disclosure, the real product words are determined through semantic analysis, and then the product words are used to determine whether the categories corresponding to the product words have mapping relationships. The specific analysis is as follows:
fig. 1 schematically illustrates an item information data processing method of an exemplary embodiment of the present disclosure. Referring to fig. 1, the article information data processing method may include the steps of:
and S102, determining product words of the articles according to the word segmentation results of the title sentences of the articles under the categories.
And step S104, acquiring the similarity between the product words and the corresponding articles.
And step S106, acquiring the sum of the similarity of each product word corresponding to different articles under the category.
And S108, selecting the product words of the category according to the sorting of the category sums.
And step S110, performing category matching processing according to the product words of the categories so as to determine the matching categories.
When the technical scheme is adopted, the product words which are mistakenly identified are removed according to the similarity between the product words and the articles, so that the real and effective product words are obtained, the accuracy rate of article category mapping can be improved, and the existing commodity category mapping scheme is optimized.
Specifically, the product words are processed by using a semantic recognition method in the scheme, and a category mapping method in which only product words are intersected in the related technology is improved.
Before step S102, the title sentence of the article needs to be participled. Specifically, the word segmentation is performed on the title sentence according to a dictionary in the word segmentation system and a set matching algorithm to obtain a word segmentation result. In step S102, part-of-speech tagging is performed on the segmentation result to find out the product words of the item in the title sentence.
In step S104, the similarity between the product word and the article is examined by acquiring the similarity between the product word and the corresponding article.
Specifically, when the similarity between a product word and a corresponding article is obtained, a word vector and a sentence vector of a title sentence of the article need to be used.
The word vector model is trained by using the title sentence of the article, and the word vector of the product word can be obtained. Here, the word vector model may be a word2vec model.
The Word2vec model is a cluster of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag of words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network.
When a sentence vector of a title sentence of an article is obtained, the sentence vector is calculated by adding word vectors of product words of the title sentence in corresponding dimensions and then taking an average value.
Here, the formula employed is:
Figure BDA0001939092140000061
wherein n is the number of product words in the title, WiIs a word vector corresponding to the ith product word, and S is a sentence vector.
The similarity between the product words and the corresponding articles can be represented by the cosine similarity between the word vector of the product words of the articles and the sentence vector of the title sentence. Therefore, when the similarity between the product words and the corresponding articles is obtained, the cosine similarity between the word vectors and the sentence vectors can be obtained.
The cosine similarity calculation formula is as follows:
cosin=A·B/(|A|×|B|)
where A, B is a vector.
In step S106, the sum of the similarity between the product word and all corresponding articles is obtained by using the following formula:
Figure BDA0001939092140000062
where p _ word represents the product word, sum _ simp_wordIs the sum of the similarity of the product word p _ word under the category and the corresponding commodity, N is the number of the commodities under the category,
Figure BDA0001939092140000063
is the similarity between the product word p _ word in the jth commodity under the category and the commodity.
Before step S108, the sum of the similarity of different articles corresponding to each product word under the category needs to be sorted.
In step S108, when the category product word is selected according to the category sum value sorting, N product words before the category sum value sorting are selected as category product words, where a ratio of N to the number of all the category product words is a set first value.
Here, the first value K (0< K <1) may be K equal to 0.70, and when N product words with the smallest sequence number are selected, a ratio of N to a total number of the product words is 0.70, where N is greater than 1, and N is a natural number. That is, if there are 10 product words under a category, and K is 0.70, N is 7, and the product word with the selected rank ratio of 0.70 is the 7 product words with the smallest rank. These 7 product words are the product words for this category.
In exemplary embodiments of the present disclosure, whether two categories match is determined according to whether product words of the categories intersect. Specifically, product words of two categories intersect, and then the two categories are considered to be matched, that is, a mapping relationship exists between the two categories. The K value may be selected by a grid search to select a better solution.
In practical applications, the item may be a commodity on a commercial webpage. Commercial web pages have a large number of different categories and categories of goods. When matching these commodities, generally, the categories of the commodities are mapped first, and the matching of the commodities is performed according to the mapping result, so as to reduce the data processing amount.
In the exemplary embodiment of the present disclosure, an article information data processing method is provided to implement commodity information mapping processing on commodities of different categories. Specifically, as shown in fig. 2, the article information data processing method according to the embodiment of the present invention includes steps S201 to S208, where:
in step S201, the title sentence of each item is segmented.
Step S202, performing part-of-speech tagging on the segmentation result.
Step S203, training a word vector model.
In step S204, a header sentence vector is calculated.
In step S205, the cosine similarity between the word vector and the sentence vector is calculated.
Step S206, calculating the sum of the similarity.
And step S207, sorting the sum of the similarity.
And step S208, taking the category relation with product word intersection as mapping.
In step S201, a word segmentation is performed on the title sentence according to a dictionary in the analysis system and a set matching algorithm to obtain a word segmentation result. In step S202, after part-of-speech tagging is performed on the segmented word, the product words of the title sentence are selected according to the part-of-speech tagging result. In step S203, the word vector model is trained using the title sentence of the commodity, and a word vector of the product word is obtained. In step S204, a sentence vector is calculated by adding word vectors of respective product words of the title sentence in corresponding dimensions and then averaging. In step S205, the cosine similarity between the word vector of the product word of the commodity and the sentence vector of the title sentence is calculated. In step S206, the sum of the similarity of the product word and all the corresponding commodities is calculated. In step S207, the sum of the similarity of each product word under the category is sorted. In step S208, the product word most representative of the category is selected according to the sorting result to represent the category, so as to perform product word intersection judgment, and when two product words have intersection, the two categories are determined to be mapped to each other.
According to the article information data processing method provided by the embodiment of the invention, the similarity between the product words and the articles is calculated, and the product words are selected according to the sum sequence of the similarity for category matching processing, so that the product words in the mistakenly identified title sentences are removed, and the accuracy of category mapping is improved.
The following describes embodiments of the apparatus of the present invention, which can be used to execute the above-mentioned article information data processing method of the present invention. As shown in fig. 3, an article information data processing apparatus 300 according to an embodiment of the present invention includes:
the determining unit 302 is configured to determine a product word of the item according to a word segmentation result of a title sentence of the item under the category;
a first obtaining unit 304, configured to obtain similarity between a product word and a corresponding article;
a second obtaining unit 306, configured to obtain a sum of similarity of different articles corresponding to each product term under the category;
and the selecting unit 308 is used for selecting the product words of the category according to the sorting of the sum of the categories.
And the matching unit is used for performing category matching processing according to the product words of the categories so as to determine the matching categories.
When the technical scheme is adopted, the product words which are mistakenly identified are removed according to the similarity between the product words and the articles, so that the real and effective product words are obtained, the accuracy rate of article category mapping can be improved, and the existing commodity category mapping scheme is optimized.
Specifically, the product words are processed by using a semantic recognition method in the scheme, and a category mapping method in which only product words are intersected in the related technology is improved.
Before the determining unit 302 determines the product word of the item, the title sentence of the item needs to be participled. Specifically, the word segmentation is performed on the title sentence according to a dictionary in the word segmentation system and a set matching algorithm to obtain a word segmentation result. The determining unit 302 performs part-of-speech tagging on the segmentation result to find out the product words of the items in the title sentence.
The first obtaining unit 304 examines the similarity of the product word and the item by obtaining the similarity of the product word and the corresponding item.
The first obtaining unit 304 trains a word vector model using the heading sentence of the article to obtain a word vector of the product word. The first obtaining unit 304 further calculates a sentence vector by adding word vectors of each product word of the question sentence in corresponding dimensions and then averaging.
After the word vectors and the sentence vectors are obtained, the first obtaining unit 304 calculates cosine similarity between the word vectors and the sentence vectors, and the cosine similarity can represent similarity between product words and corresponding articles.
The second obtaining unit 306 obtains the sum of the similarity between the product word and all corresponding articles by using the following formula:
Figure BDA0001939092140000091
wherein p _ word represents a productWord, sum _ simp_wordIs the sum of the similarity of the product word p _ word under the category and the corresponding commodity, N is the number of the commodities under the category,
Figure BDA0001939092140000092
is the similarity between the product word p _ word in the jth commodity under the category and the commodity.
The selecting unit 308 selects N product words before sorting of the category sums as the category product words, where the ratio of N to the number of all the product words under the category is a set first numerical value.
Here, the first value K (0< K <1) may be K equal to 0.70, and when N product words with the smallest sequence number are selected, a ratio of N to a total number of the product words is 0.70, where N is greater than 1, and N is a natural number. That is, if there are 10 product words under a category, and K is 0.70, N is 7, and the product word with the selected rank ratio of 0.70 is the 7 product words with the smallest rank. These 7 product words are the product words for this category.
In an exemplary embodiment of the present disclosure, the matching unit 310 determines whether two categories match according to whether product words of the categories intersect. Specifically, product words of two categories intersect, and then the two categories are considered to be matched, that is, a mapping relationship exists between the two categories. The K value may be selected by a grid search to select a better solution.
According to an exemplary embodiment of the present disclosure, referring to fig. 4, compared to the item information data processing apparatus 300, the item information data processing apparatus 400 includes not only the determining unit 302, the first obtaining unit 304, the second obtaining unit 306, the selecting unit 308, and the matching unit 310, but also the sorting unit 402.
Specifically, the sorting unit 402 is configured to sort the sum of the similarity of different items corresponding to each product word under the category.
According to the article information data processing device provided by the embodiment of the invention, the similarity between the product words and the articles is calculated, and the product words are selected according to the sum sequence of the similarity for category matching processing, so that the product words in the mistakenly identified title sentences are removed, and the accuracy of category mapping is improved.
Referring now to FIG. 5, shown is a block diagram of a computer system 700 suitable for use with the electronic device implementing an embodiment of the present invention. The computer system 700 of the electronic device shown in fig. 5 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for system operation are also stored. The CPU701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
To the I/O interface 705, AN input section 706 including a keyboard, a mouse, and the like, AN output section 707 including a keyboard such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 708 including a hard disk and the like, and a communication section 709 including a network interface card such as a L AN card, a modem, and the like, the communication section 709 performs communication processing via a network such as the internet, the drive 710 is also connected to the I/O interface 707 as necessary, a removable medium 711 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program executes the above-described functions defined in the system of the present application when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs, which when executed by the electronic device, cause the electronic device to implement the item information data processing method as described in the above embodiments.
For example, the electronic device may implement as shown in fig. 1: step S102, determining product words of the articles according to word segmentation results of title sentences of the articles under the categories; step S104, obtaining the similarity between the product words and corresponding articles; step S106, obtaining the sum of similarity of different articles corresponding to each product word in the category; s108, selecting product words of the category according to the sorting of the sum values under the category; and step S110, performing category matching processing according to the product words of the categories so as to determine the matching categories.
As another example, the electronic device may implement the steps shown in FIG. 2.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. An article information data processing method, characterized by comprising:
determining the product words of the articles according to the word segmentation result of the title sentences of the articles under the category;
acquiring the similarity between the product words and corresponding articles;
acquiring the sum of the similarity of each product word corresponding to different articles under the category;
selecting product words of the category according to the sorting of the sum values under the category;
and performing category matching processing according to the product words of the categories so as to determine the matching categories.
2. The method of claim 1, wherein the obtaining the similarity of the product word to the corresponding item comprises:
training a word vector model by using the title sentence of the article to obtain a word vector of the product word;
obtaining a sentence vector of the title sentence of the article;
and obtaining the cosine similarity of the word vector and the sentence vector.
3. The method of claim 2, wherein said obtaining a sentence vector of a title sentence of the item comprises:
and obtaining the sentence vector by adding word vectors of all product words of the title sentence on corresponding dimensions and then taking a mean value.
4. The method of claim 3, wherein prior to selecting a product word according to the sorting of the sum values under categories, the method further comprises:
sorting the sum of the similarity of different articles corresponding to each product word under the category;
selecting the product words of the category according to the sorting of the sum values under the category comprises the following steps:
and selecting N product words before the sorting of the sum under the category as the product words of the category, wherein the ratio of N to the number of all the product words under the category is a set first numerical value.
5. The method of claim 4, wherein performing a category matching process based on the product words of the category to determine a matching category comprises:
and judging whether the categories are matched according to whether the product words of the categories have intersection.
6. An article information data processing apparatus characterized by comprising:
the determining unit is used for determining the product words of the articles according to the word segmentation results of the title sentences of the articles under the categories;
the first acquisition unit is used for acquiring the similarity between the product words and corresponding articles;
the second acquisition unit is used for acquiring the sum of the similarity of different articles corresponding to each product word under the category;
the selecting unit is used for selecting the product words of the categories according to the sorting of the sum values under the categories;
and the matching unit is used for performing category matching processing according to the product words of the categories so as to determine the matching categories.
7. The apparatus according to claim 6, further comprising a sorting unit configured to sort a sum of similarity values of different items corresponding to each of the product words under the category;
the selecting unit is further configured to select N product words before the sum in the category is ranked as the product words in the category, where a ratio of N to the number of all product words in the category is a set first numerical value.
8. The apparatus of claim 7, wherein the matching unit is further configured to determine whether the categories match according to whether product words of the categories intersect.
9. A computer-readable medium on which a computer program is stored, characterized in that the program, when executed by a processor, implements the item information data processing method according to any one of claims 1 to 5.
10. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the item information data processing method according to any one of claims 1 to 5.
CN201910016030.4A 2019-01-08 2019-01-08 Article information data processing method, device, medium and electronic equipment Active CN111428486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910016030.4A CN111428486B (en) 2019-01-08 2019-01-08 Article information data processing method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910016030.4A CN111428486B (en) 2019-01-08 2019-01-08 Article information data processing method, device, medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN111428486A true CN111428486A (en) 2020-07-17
CN111428486B CN111428486B (en) 2023-06-23

Family

ID=71545950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910016030.4A Active CN111428486B (en) 2019-01-08 2019-01-08 Article information data processing method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111428486B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329653A (en) * 2020-11-09 2021-02-05 北京沃东天骏信息技术有限公司 Data processing method, device, computer system and readable storage medium
CN114529337A (en) * 2022-02-08 2022-05-24 北京电解智科技有限公司 Information detection method and device
WO2023202170A1 (en) * 2022-04-21 2023-10-26 北京沃东天骏信息技术有限公司 Product word disambiguation method and apparatus

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965905A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Web page classifying method and apparatus
JP2016206487A (en) * 2015-04-24 2016-12-08 日本電信電話株式会社 Voice recognition result shaping device, method and program
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
US20170169008A1 (en) * 2015-12-15 2017-06-15 Le Holdings (Beijing) Co., Ltd. Method and electronic device for sentiment classification
CN107168992A (en) * 2017-03-29 2017-09-15 北京百度网讯科技有限公司 Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence
CN107657284A (en) * 2017-10-11 2018-02-02 宁波爱信诺航天信息有限公司 A kind of trade name sorting technique and system based on Semantic Similarity extension
CN107729917A (en) * 2017-09-14 2018-02-23 北京奇艺世纪科技有限公司 The sorting technique and device of a kind of title
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN108763333A (en) * 2018-05-11 2018-11-06 北京航空航天大学 A kind of event collection of illustrative plates construction method based on Social Media

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016206487A (en) * 2015-04-24 2016-12-08 日本電信電話株式会社 Voice recognition result shaping device, method and program
CN104965905A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Web page classifying method and apparatus
US20170169008A1 (en) * 2015-12-15 2017-06-15 Le Holdings (Beijing) Co., Ltd. Method and electronic device for sentiment classification
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN107168992A (en) * 2017-03-29 2017-09-15 北京百度网讯科技有限公司 Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence
CN107729917A (en) * 2017-09-14 2018-02-23 北京奇艺世纪科技有限公司 The sorting technique and device of a kind of title
CN107657284A (en) * 2017-10-11 2018-02-02 宁波爱信诺航天信息有限公司 A kind of trade name sorting technique and system based on Semantic Similarity extension
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN108763333A (en) * 2018-05-11 2018-11-06 北京航空航天大学 A kind of event collection of illustrative plates construction method based on Social Media

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SADRZADEH, MEHRNOOSH: "Sentence entailment in compositional distributional semantics" *
邱宁佳: "结合改进主动学习的SVD-CNN弹幕文本分类算法" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329653A (en) * 2020-11-09 2021-02-05 北京沃东天骏信息技术有限公司 Data processing method, device, computer system and readable storage medium
CN114529337A (en) * 2022-02-08 2022-05-24 北京电解智科技有限公司 Information detection method and device
WO2023202170A1 (en) * 2022-04-21 2023-10-26 北京沃东天骏信息技术有限公司 Product word disambiguation method and apparatus

Also Published As

Publication number Publication date
CN111428486B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN107679039B (en) Method and device for determining statement intention
CN112164391A (en) Statement processing method and device, electronic equipment and storage medium
CN112395487B (en) Information recommendation method and device, computer readable storage medium and electronic equipment
CN111046154A (en) Information retrieval method, information retrieval device, information retrieval medium and electronic equipment
CN111428486B (en) Article information data processing method, device, medium and electronic equipment
CN111797622B (en) Method and device for generating attribute information
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN113051380B (en) Information generation method, device, electronic equipment and storage medium
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN114372475A (en) Network public opinion emotion analysis method and system based on RoBERTA model
CN113821588A (en) Text processing method and device, electronic equipment and storage medium
CN113988157A (en) Semantic retrieval network training method and device, electronic equipment and storage medium
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN116644148A (en) Keyword recognition method and device, electronic equipment and storage medium
CN114647739B (en) Entity chain finger method, device, electronic equipment and storage medium
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query
CN116383382A (en) Sensitive information identification method and device, electronic equipment and storage medium
CN111507098B (en) Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN114138976A (en) Data processing and model training method and device, electronic equipment and storage medium
CN113139056A (en) Network data clustering method, clustering device, electronic device and medium
CN111708862A (en) Text matching method and device and electronic equipment
CN113139382A (en) Named entity identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant