CN111563117B - Structured information display method and device, electronic equipment and computer readable medium - Google Patents

Structured information display method and device, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN111563117B
CN111563117B CN202010671714.0A CN202010671714A CN111563117B CN 111563117 B CN111563117 B CN 111563117B CN 202010671714 A CN202010671714 A CN 202010671714A CN 111563117 B CN111563117 B CN 111563117B
Authority
CN
China
Prior art keywords
word
information
word vector
cluster center
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010671714.0A
Other languages
Chinese (zh)
Other versions
CN111563117A (en
Inventor
韩韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Hongli Intellectual Property Service Co ltd
Original Assignee
Beijing Missfresh Ecommerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Missfresh Ecommerce Co Ltd filed Critical Beijing Missfresh Ecommerce Co Ltd
Priority to CN202010671714.0A priority Critical patent/CN111563117B/en
Publication of CN111563117A publication Critical patent/CN111563117A/en
Application granted granted Critical
Publication of CN111563117B publication Critical patent/CN111563117B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the disclosure discloses a structured information display method, a structured information display device, an electronic device and a computer readable medium. One embodiment of the method comprises: vectorizing the pre-marked article name to generate a word vector set; determining relation information among the word vectors in the word vectors to obtain a relation information set; serializing each relationship information in the relationship information set to generate serialized information to obtain a serialized information set; labeling each piece of serialized information in the serialized information set to generate labeled serialized information, and obtaining a labeled serialized information set; and generating and displaying the structural information related to the article name according to the marked serialized information set. The method and the device for displaying the structured information improve the generation efficiency and accuracy of the structured information, and improve the accuracy of the display of the structured information, so that the user experience is improved.

Description

Structured information display method and device, electronic equipment and computer readable medium
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a structured information display method, a structured information display device, electronic equipment and a computer readable medium.
Background
The structured information may be a plurality of interrelated information groups into which the item name is parsed. For the generation and display of the structured information, a current common method is to perform dictionary matching on item names in an item library through a mined dictionary of various structured information, and then generate and display corresponding structured information. The method has the problem of low accuracy in labeling the names of articles containing multi-paraphrase words. In addition, for the item names newly added into the item library, because various mined structured information dictionaries do not contain corresponding structured information dictionaries, the structured information can not be labeled well, and the generation and display of the structured information are not accurate enough.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Some embodiments of the disclosed method propose a structured information display method, apparatus, electronic device and computer readable medium to solve the technical problems mentioned in the background section above.
In a first aspect, some embodiments of the present disclosure provide a structured information display method, including: vectorizing the pre-marked article name to generate a word vector set; determining the relationship information among the word vectors in the word vector set to obtain a relationship information set; serializing each relationship information in the relationship information set to generate serialized information to obtain a serialized information set; labeling each piece of serialized information in the serialized information set to generate labeled serialized information, and obtaining a labeled serialized information set; and generating and displaying the structural information related to the article name according to the marked serialized information set.
In a second aspect, some embodiments of the present disclosure provide a structured information display apparatus, the apparatus comprising: the first determining unit is configured to vectorize the pre-marked article name to generate a word vector set; a second determining unit configured to determine relationship information between the word vectors in the word vector set, resulting in a relationship information set; a third determining unit configured to serialize each of the relationship information in the relationship information sets to generate serialized information, resulting in a serialized information set; a fourth determining unit, configured to label each piece of serialized information in the serialized information to generate labeled serialized information, so as to obtain labeled serialized information; and the generating and displaying unit is configured to generate and display the structural information related to the article name according to the marked serialized information set.
In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement the method as described in the first aspect.
In a fourth aspect, some embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method as described in the first aspect.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: and generating a word vector set by vectorizing and expressing the pre-marked article name. And then, determining the relationship information among the word vectors in the word vector set to obtain a relationship information set. Therefore, the accuracy of labeling can be improved according to the relationship information in the relationship information set, and the problem of low accuracy of labeling the multi-paraphrase words contained in the article names is solved. The serialized information set is obtained by serializing the relationship information in the relationship information set to generate serialized information. Then, each relationship information in the relationship information set is serialized to generate serialized information, so as to obtain a labeled serialized information set. And then, generating and displaying the structured information related to the article name according to the marked serialized information set. Therefore, the structured information corresponding to the article name can be actively generated and displayed according to the generated marking information and the words after the article name is participled. The problem that for the names of the articles newly added into the article library, the excavated various structured information dictionaries do not contain corresponding structured information dictionaries, so that the structured information cannot be well labeled, and the generation and display of the structured information are not accurate enough is solved.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
FIG. 1 is a schematic diagram of one application scenario of a structured information display method in accordance with some embodiments of the present disclosure;
FIG. 2 is a flow diagram of some embodiments of a structured information display method according to the present disclosure;
FIG. 3 is a flow diagram of further embodiments of a structured information display method according to the present disclosure;
FIG. 4 is a schematic structural diagram of some embodiments of a structured information display apparatus according to the present disclosure;
FIG. 5 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be mutually grouped without conflict.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is a schematic diagram of an application scenario of a structured information display method according to some embodiments of the present disclosure.
In the application scenario of fig. 1, first, the computing device 101 may determine a generated word vector set 103 from the pre-labeled item name 102. The computing device 101 may then derive a set of relationship information 104 from the generated set of word vectors 103. Then, the computing device 101 can generate a serialized information set 105 based on the relational information set 104. Second, the computing device 101 can generate a annotated set of serialization information 106 from the set of serialization information 105. Finally, the computing device 101 can generate and display structured information 107 for the item name from the labeled serialized information set 106.
The computing device 101 may be hardware or software. When the computing device is hardware, it may be implemented as a distributed cluster composed of multiple servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device is embodied as software, it may be installed in the hardware devices enumerated above. It may be implemented, for example, as multiple pieces of software and software modules used to provide distributed services, or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of computing devices in FIG. 1 is merely illustrative. There may be any number of computing devices, as implementation needs dictate.
With continued reference to FIG. 2, a flow 200 of some embodiments of a structured information display method according to the present disclosure is shown. The method may be performed by the computing device 101 of fig. 1. The method for generating the structured information comprises the following steps:
step 201, performing vectorization representation on the pre-labeled item name to generate a word vector set.
In some embodiments, an executing body of the structured information display method (such as the computing device 101 shown in fig. 1) may perform vectorization representation on the pre-labeled item name, and generate a word vector set. The pre-labeled item name can be obtained by pre-labeling the item name through the mined category and then performing data cleaning on the pre-labeled data.
As an example, the pre-labeled item name may be "red fuji apple of shandong tobacco pipe", and the mined category may be "fruit", "mobile brand". Then, the item name is pre-labeled, and the resulting pre-labeled item name may be "red fuji apple of Shandong tobacco pipe-fruit", "red fuji apple of Shandong tobacco pipe-mobile brand". Through data cleaning of the marked article names, the label of the red Fuji apple-mobile phone brand of the Shandong tobacco station is wrong, and the label of the red Fuji apple-fruit of the Shandong tobacco station is correct. And then, obtaining the pre-labeled article name corresponding to the red Fuji of the Shandong tobacco station.
In addition, the pre-marked article name is vectorized to express, so that text data is converted into vectors, and computer processing is facilitated. Meanwhile, the method can be used as input data, and the relation between word vectors can be conveniently obtained in the next step. The method can comprise the following steps:
firstly, performing word segmentation processing on the pre-labeled article name.
As an example, the pre-labeled item name may be "red fuji apple of shandong pipe-fruit" which is subject to word segmentation to obtain "[ shandong, pipe, red fuji, apple ] -fruit".
And secondly, carrying out one-hot encoding processing on the data obtained by the word processing.
As an example, we make a vectorized representation of "[ shandong, tabacco, red fuji, apple ] -fruit". The word vector obtained after the one-hot encoding process may include: { "Shandong": [01000000000], "smoke bench": [00010000000], "of: [00000001000], "Red Fuji": [00000010000], "apple": [00000000010], "fruit": [00000000001]}. Thereby obtaining the word vector set after the pre-labeled article name is represented in a warp quantization way.
Step 202, determining the relationship information among the word vectors in the word vector set to obtain a relationship information set.
In some embodiments, the execution subject determines the relationship between the word vectors based on the word vectors in the word vector set, resulting in a relationship information set. The relationship information refers to association degree information between word vectors. The association degree information can be calculated by the value between the word vectors. As an example, the calculation of the association degree information may include the steps of:
first, each dimension of the word vector is inverted.
As an example, the word vector [01000000000] corresponding to the above-mentioned "shandong" is inverted to obtain a new word vector [10111111111 ]. And (4) inverting the word vector corresponding to the cigarette platform to obtain a new word vector [11101111111 ].
Second, a first relation value between the word vectors is calculated by a first formula:
Figure 236279DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 622261DEST_PATH_IMAGE002
representing the cosine values of the two word vectors.
Figure 261052DEST_PATH_IMAGE003
The dimensions of the word vector are represented by,
Figure 859524DEST_PATH_IMAGE003
the value range of (a) is [1,
Figure 939475DEST_PATH_IMAGE004
]and is and
Figure 355413DEST_PATH_IMAGE003
only integers can be taken.
Figure 825709DEST_PATH_IMAGE004
Indicating the length of each word vector.
Figure 290188DEST_PATH_IMAGE005
Representing a first word vector
Figure 286963DEST_PATH_IMAGE003
The value of the dimension(s) is,
the value is 0 or 1.
Figure 545906DEST_PATH_IMAGE006
Represents the secondFirst of word vector
Figure 769077DEST_PATH_IMAGE003
The value of dimension is 0 or 1.
By way of example, calculating a first relationship value for "Shandong" and "smoking stage" and substituting the inverted word vector for "Shandong" and the inverted word vector for "smoking stage" into the first formula above may result in
Figure 833985DEST_PATH_IMAGE007
=0.09。
Thirdly, based on the first relation value, obtaining association degree information among the word vectors through the following association degree formula:
Figure 622950DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 521635DEST_PATH_IMAGE009
indicating the degree of association of the two word vectors.
Figure 91157DEST_PATH_IMAGE007
Representing cosine values between two word vectors.
As an example, the above-mentioned association degree formula is substituted with a cosine value of 0.09 obtained by the cosine value formula for the inverted word vector corresponding to the above-mentioned "shandong" and "smoke platform", and the association degree of the two words "shandong" and "smoke platform" is 0.91.
Step 203, serializing each relationship information in the relationship information set to generate serialized information, so as to obtain a serialized information set.
In some embodiments, the execution subject serializes each relationship information in the relationship information set to obtain a serialized relationship information set. Where serialization is used to convert each relationship information in a set of relationship information into a form that can be stored or transmitted.
As an example, the relationship information of "shandong", "smoke table", "red fuji", "apple" is serialized to obtain the following serialized information sets:
[ [ "Shandong-tobacco stage", [01000000000], [00010000000], 0.91],
[ "Shandong-shaped", [01000000000], [00000001000], 0.05],
[ "Shandong-Red Fuji ], [01000000000], [00000010000], 0.65],
[ "Shandong-apple", [01000000000], [00000000010], 0.62],
[ "smokable station-", [00010000000], [00000001000], 0.04],
[ "smoky stage-red fushi ], [00010000000], [00000010000], 0.72],
[ "tobacco stage-apple", [00010000000], [00000000010], 0.70],
[ "Red Fuji ], [00000001000], [00000010000], 0.04],
[ "apple", [00000001000], [00000000010], 0.03],
[ "red fuji-apple", [00000010000], [00000000010], 0.96] ].
And 204, labeling each piece of serialized information in the serialized information set to generate labeled serialized information, and obtaining a labeled serialized information set.
In some embodiments, the execution body labels each piece of serialization information in the serialization information set to obtain a labeled serialization information set. The tagged serialization information may include at least one tagged word vector and its corresponding tagging information. According to the actual situation, the tagged content may also include the part of speech of the tagged word vector, the category of the word vector, and the like.
As an example, labeling the relationship information set of "red fuji apple of shandong tobacco pipe" results in the following labeled serialized information sets:
[ [ "Shandong", [01000000000], [ "n", "province" ]),
[ "smoke table", [00010000000], [ "n", "city ] ],
[ "of", [00000001000], [ "aux ] ],
[ "red fushi", [00000010000], [ "n", "fruit variety ] ],
[ "apple", [00000000010], [ "n", "fruit name" ] ] ].
Step 205, generating the structured information related to the item name according to the labeled serialized information set.
In some embodiments, the execution subject extracts and combines the labeled serialized information in the labeled serialized information set to generate the structured information related to the name of the article to which the execution subject belongs. According to the actual situation, selecting the labeling information in the word vector labeling content, and combining to generate the structural information corresponding to the article name.
As an example, the labeled information in the labeled serialized information set of "red fuji apple of santong tai" is extracted to obtain the following structured information:
{ 'part saving': 'Shandong', 'City': 'smoking stage', 'fruit type': 'red fuji', 'fruit name': the 'apple' of the tree is divided into three parts,
{ 'Shandong': 'n', 'chimney table': 'of' n ',': 'aux', 'red fuji': 'n', 'apple': 'n' }.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: and generating a word vector set by vectorizing and expressing the pre-marked article name. And then, determining the relationship information among the word vectors in the word vector set to obtain a relationship information set. Therefore, the accuracy of labeling can be improved according to the relationship information in the relationship information set, and the problem of low accuracy of labeling the multi-paraphrase words contained in the article names is solved. The serialized information set is obtained by serializing the relationship information in the relationship information set to generate serialized information. Then, each relationship information in the relationship information set is serialized to generate serialized information, so as to obtain a labeled serialized information set. And then, generating and displaying the structured information related to the article name according to the marked serialized information set. Therefore, the structured information corresponding to the article name can be actively generated and displayed according to the generated marking information and the words after the article name is participled. The problem that for the names of the articles newly added into the article library, the excavated various structured information dictionaries do not contain corresponding structured information dictionaries, so that the structured information cannot be well labeled, and the generation of the structured information is not accurate enough is solved.
With further reference to FIG. 3, a flow 300 of further embodiments of a structured information display method according to the present disclosure is shown. The above-described method may be performed by the computing device 101 of fig. 1. The structured information display method comprises the following steps:
step 301, performing word segmentation processing on the pre-labeled article name to generate at least one word.
In some embodiments, an executing body (such as the computing device 101 shown in fig. 1) of the structured information display method may perform word segmentation processing on the pre-labeled item name to generate at least one word.
As an example, the pre-labeled item name may be "the hunan rice flour of the long sand of the hunan-food", where "food" is pre-labeled information for the item name "the hunan rice flour of the long sand of the hunan". Performing word segmentation treatment to obtain the product of ' Hunan, Changsha, Hunan, rice flour ' -food '.
Step 302, determining the position of each word in the pre-labeled item name as a number, and obtaining a number set.
In some embodiments, the execution subject obtains the number set based on a word obtained by performing word segmentation processing on the pre-labeled item name and a position of the word obtained by word segmentation in the pre-labeled item name.
As an example, the word that results from the word segmentation may be "[ hunan, Changsha, Hunan, Rice flour ] -food," determining the corresponding number sets [ '1-Hunan', '2-Changsha', '3', '4-Hunan', '5-Rice flour').
Step 303, determining the number of times each term appears in the pre-labeled item name, to obtain a number set.
In some embodiments, the performing agent obtains the set of times by determining a number of times each term occurs in the pre-labeled item name.
As an example, the word from the participle may be "[ hunan, changsha, hunan, rice flour ] -food product", the corresponding numbering set [ '1-hunan', '2-changsha', '3', '4-hunan', '5-rice flour' ], resulting in the number set "[ 2, 1, 1, 2, 1 ]".
Step 304, determining each word as a value, resulting in a value set.
In some embodiments, the execution body determines each word as a value, resulting in a set of values.
By way of example, the word from the word segmentation may be "[ Hunan, Changsha, Hunan, Rice flour ] -food", resulting in a value set [ Hunan, Changsha, Hunan, Rice flour ].
Step 305, generating a word dictionary based on the number set and the value set.
In some embodiments, the execution body generates the word dictionary based on the number set and the value set.
As an example, the number set may be "[ '1-hunan', '2-long sand', '3', '4-hunan', '5-rice flour' ]", and the value set may be "[ hunan, long sand, hunan, rice flour ]". Obtaining a word dictionary "{" 1 ": "Hunan", "2": "long sand", "3": "and" 4 ": "Hunan", "5": "rice flour" } ".
Step 306, determining word vectors corresponding to the words based on the word dictionary and the time sets.
In some embodiments, the execution body may determine the word vector corresponding to the word through the word dictionary and the time set.
As an example, the word dictionary may be { "1": "Hunan", "2": "long sand", "3": "and" 4 ": "Hunan", "5": "rice flour" }. The set of times may be [2, 1, 1, 2, 1 ]. Thus obtaining a corresponding word vector set { "Hunan": [20000] And "sand growth": [01000] And is as follows: [00100] And "Hunan": [00020] And the 'rice flour': [00001]}.
Step 307, cluster center sets are generated.
In some embodiments, the executing entity selects a predetermined number of word vectors as cluster centers from the word vectors in the word vector set, and determines the cluster center set, including the following steps:
firstly, selecting a word vector from a word vector set to generate a cluster center set.
As an example, a word vector "[ 01000 ]" corresponding to "long sand" is selected, and a cluster center set "[ [01000] ]" is generated.
Secondly, calculating the distance between the residual word vectors in the word vector set and each cluster center in the cluster center set by using the following distance formula to obtain a distance set:
Figure 631860DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 744172DEST_PATH_IMAGE011
representing the distance between the word vector and the cluster center.
Figure 407235DEST_PATH_IMAGE012
Representing the length of the word vector.
Figure 604998DEST_PATH_IMAGE013
The dimensions of the word vector are represented by,
Figure 949392DEST_PATH_IMAGE013
the value range of (a) is [1,
Figure 775265DEST_PATH_IMAGE004
]。
Figure 812491DEST_PATH_IMAGE005
to represent the word vector
Figure 966392DEST_PATH_IMAGE013
The value of the dimension.
Figure 114477DEST_PATH_IMAGE006
Represents a second word vector
Figure 794857DEST_PATH_IMAGE013
The value of the dimension.
As an example, the word vector [00001] corresponding to "rice flour" is selected]Calculating the distance between the word vector corresponding to the rice flour and the center of each cluster in the cluster center set by the distance formula to obtain the distance set
Figure 2984DEST_PATH_IMAGE014
]”。
And thirdly, adding the minimum distance in the upper distance set to the alternative distance set.
As an example, the word vector "[ 00001] corresponding to" rice flour]Distance set [ obtained by calculating distance formula from each cluster center in cluster center set ]
Figure 644181DEST_PATH_IMAGE014
]The minimum distance in "is added to the candidate distance set to obtain a candidate distance set" ","
Figure 330377DEST_PATH_IMAGE014
]”。
And fourthly, repeating the second step and the third step until the number of the selected distances in the alternative distance set is equal to the number of the remaining word vectors in the word vector set. And selecting a word vector corresponding to the maximum distance from the alternative distance set, and adding the word vector into the cluster center set.
As an example, a word vector corresponding to "Hunan" [20000]"and" corresponding word vector "[ 00100]]"," Hunan "corresponding word vector" [00020]]"," Rice flour "corresponding word vector" [00001]]"the minimum distance among the corresponding candidate distances calculated by the distance formula from each cluster center in the cluster center set is added to the candidate distance set to obtain the candidate distance"
Figure 130843DEST_PATH_IMAGE015
Figure 509872DEST_PATH_IMAGE014
Figure 638365DEST_PATH_IMAGE015
Figure 128252DEST_PATH_IMAGE014
]". The word vector corresponding to the maximum distance in the alternative distance set is' 20000]”,“[00020]"add to cluster center set.
And fifthly, repeating the fourth step until the cluster center quantity in the cluster center set reaches a preset number.
And 308, determining the relationship information among the word vectors in the word vector set to obtain a relationship information set.
In some embodiments, the execution subject obtains the relationship information set based on a word vector set and a cluster center set. The relationship information refers to the association degree information between two word vectors, and the association degree information between two word vectors can be obtained through the following steps:
the method comprises the following steps of firstly, calculating the distance between a word vector in a word vector set and each cluster center in a cluster center set through the following second distance formula based on the cluster center set:
Figure 517645DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 67575DEST_PATH_IMAGE017
representing the distance between the word vector and the cluster center.
Figure 948943DEST_PATH_IMAGE012
Representing the length of the word vector.
Figure 304838DEST_PATH_IMAGE013
The dimensions of the word vector are represented by,
Figure 752000DEST_PATH_IMAGE013
the value range of (a) is [1,
Figure 144935DEST_PATH_IMAGE004
]。
Figure 310338DEST_PATH_IMAGE005
to represent the word vector
Figure 204344DEST_PATH_IMAGE013
The value of the dimension.
Figure 506013DEST_PATH_IMAGE006
Represents the center of the cluster
Figure 132166DEST_PATH_IMAGE013
The value of the dimension.
Figure 722547DEST_PATH_IMAGE018
Representing the word vector length.
As an example, the cluster center set is "[ [20000], [01000] ]", where the word vector "[ 20000 ]" corresponds to the word "hunan", the word vector "[ 01000 ]" corresponds to the word "long sand", the word vector set is "{" rice flour ": [00001]}". Respectively calculating the distance between the word vector in the word vector set and the cluster center in the cluster center set through the second distance formula, wherein the result respectively is as follows: 1,0.01.
And secondly, adding the word vectors into the nearest cluster according to the calculation result, recalculating the cluster center, and replacing the corresponding cluster center in the cluster center set with a new cluster center.
And thirdly, repeating the first step and the second step until the cluster centers in the cluster center set converge to obtain the relation information among the word vectors in the word vector set, and obtaining the relation information set according to the relation information among the word vectors.
Step 309, serializing each relationship information in the relationship information set to generate serialized information, so as to obtain a serialized information set.
Step 310, labeling each piece of serialized information in the serialized information set to generate labeled serialized information, and obtaining a labeled serialized information set.
Step 311, generating the structured information related to the item name according to the labeled serialized information set.
In some embodiments, the specific implementation and technical effects of steps 309 and 311 can refer to steps 203 and 205 in the embodiments corresponding to fig. 2, which are not described herein again.
One of the above-described various embodiments of the present disclosure has the following advantageous effects:
and generating a cluster center set through the distance, then calculating the distance between the word vector and the cluster center in the cluster center set, and continuously executing the two steps until the cluster center in the cluster center set is converged, thereby determining the relation information between the word vectors in the word vector set. Through the method, the relevance among the word vectors is enhanced, and the obtained relation information among the word vectors is more accurate, so that the word labeling accuracy is improved, and the generation and display accuracy of the structured information is improved.
With further reference to fig. 4, as an implementation of the above-described method for the above-described figures, the present disclosure provides some embodiments of a structured information display apparatus, which correspond to those of the method embodiments described above with reference to fig. 2, and which may be applied in various electronic devices.
As shown in fig. 4, the structured information display apparatus 400 of some embodiments includes: a first determining unit 401, a second determining unit 402, a third determining unit 403, a fourth determining unit 404, a generating and displaying unit 405. The first determining unit 401 is configured to vectorize the pre-labeled item name to generate a word vector set; a second determining unit 402 configured to determine relationship information between word vectors in the word vector set, resulting in a relationship information set; a third determining unit 403, configured to serialize each relationship information in the relationship information sets to generate serialized information, resulting in serialized information sets; a fourth determining unit 404, configured to label each piece of serialized information in the serialized information to generate labeled serialized information, so as to obtain labeled serialized information; a generating and displaying unit 405 configured to generate and display the structured information related to the item name according to the labeled serialized information set.
It will be understood that the elements described in the apparatus 400 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 400 and the units included therein, and will not be described herein again.
Referring now to FIG. 5, a block diagram of an electronic device (e.g., computing device 101 of FIG. 1) 500 suitable for use in implementing some embodiments of the present disclosure is shown. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 5 may represent one device or may represent multiple devices as desired.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of some embodiments of the present disclosure.
It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the apparatus; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: vectorizing the pre-marked article name to generate a word vector set; determining the relation between the word vectors in the word vectors to obtain a relation information set; serializing each relationship information in the relationship information set to generate serialized information to obtain a serialized information set; labeling each piece of serialized information in the serialized information set to generate labeled serialized information, and obtaining a labeled serialized information set; and generating and displaying the structured information related to the article name according to the marked serialized information set.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first determining unit, a second determining unit, a third determining unit, a fourth determining unit, a generating and displaying unit. Where the names of these elements do not in some cases constitute a limitation on the elements themselves, for example, the generating and displaying element may also be described as an "element that generates and displays structured information relating to the name of an item from a set of serialized information that has been labeled".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (9)

1. A structured information display method, comprising:
vectorizing the pre-marked article name to generate a word vector set;
determining relation information among the word vectors in the word vector set to obtain a relation information set;
serializing each relationship information in the relationship information set to generate serialized information to obtain a serialized information set;
labeling each piece of serialized information in the serialized information set to generate labeled serialized information, and obtaining a labeled serialized information set;
generating and displaying structural information related to the article name according to the labeled serialized information set, wherein the determining of the relationship information between the word vectors in the word vector set to obtain a relationship information set comprises:
selecting word vectors from the word vector set to generate a cluster center set;
forming a residual word vector set by the unselected word vectors in the word vector set;
selecting residual word vectors from the residual word vector set, and executing the following cluster center set processing generation steps:
determining the distance between the selected residual word vector and each cluster center in the cluster center set to obtain a distance set;
determining a minimum distance in the set of distances;
adding the minimum distance as a candidate distance to a candidate distance set;
in response to the fact that the number of the alternative distances in the alternative distance set is equal to the number of the word vectors in the remaining word vector set, selecting the remaining word vectors corresponding to the maximum distance in the alternative distance set to be added to the cluster center set;
outputting the cluster center set in response to the number of cluster centers in the cluster center set reaching a predetermined number;
and in response to the cluster center quantity in the cluster center set not reaching the preset number, recombining the residual word vectors which are not selected in the residual vector set into a residual word vector set, selecting the residual word vectors from the recombined residual word vector set, and executing the step of generating the processing cluster center set again.
2. The method of claim 1, wherein the method further comprises:
a storage device controlling the communication connection stores the structured information.
3. The method of claim 1, wherein the vectorizing the pre-labeled item name to generate a set of word vectors comprises:
performing word segmentation processing on the pre-labeled article name to generate at least one word;
and converting each generated word into a corresponding word vector to obtain a word vector set.
4. The method of claim 3, wherein the method further comprises:
determining the position of each word in the pre-labeled article name as a number to obtain a number set;
determining the occurrence frequency of each word in the pre-labeled article name to obtain a frequency set;
determining each word as a value to obtain a value set;
generating a word dictionary based on the number set and the value set;
and determining a word vector corresponding to each word according to the word dictionary and the times set.
5. The method of claim 1, wherein the determining relationship information between word vectors in the set of word vectors, resulting in a set of relationship information, further comprises:
selecting a word vector from said set of word vectors, performing a first processing step as follows:
determining the word vector cluster center distance between the selected word vector and each cluster center in the cluster center set to obtain a word vector cluster center distance set;
selecting a minimum word vector cluster center distance from the word vector cluster center distance set;
responding to the fact that the distance between the centers of the minimum word vector clusters is smaller than a preset threshold value, and adding the selected word vectors into the cluster which is closest to the word vectors in the cluster set;
removing the word vector from a set of word vectors in response to the minimum word vector cluster center distance being greater than or equal to a predetermined threshold;
generating a new cluster center set according to the existing word vectors in the cluster set, and determining the new cluster center set as a cluster center set;
determining whether an unconverged cluster center exists in the cluster center set;
in response to determining that there is no relationship information between the word vectors;
in response to determining that there is a word vector, selecting an unselected word vector from the set of word vectors, and performing the first processing step again.
6. The method of claim 5, wherein the tagged serialization information includes at least one tagged word vector and a set of tagged information corresponding to the at least one tagged word vector, the tagged information including part-of-speech and category information of the corresponding tagged word vector.
7. A structured information display apparatus comprising:
the first determining unit is configured to perform vectorization representation on the pre-marked article name to generate a word vector set;
a second determining unit configured to determine relationship information between word vectors in the set of word vectors, resulting in a set of relationship information;
a third determining unit configured to serialize each relationship information in the relationship information set to generate serialized information, resulting in a serialized information set;
a fourth determining unit configured to label each piece of serialized information in the serialized information set to generate labeled serialized information, resulting in a labeled serialized information set;
a generating and displaying unit configured to generate and display the structured information related to the item name according to the labeled serialized information set, wherein the second determining unit is further configured to:
selecting word vectors from the word vector set to generate a cluster center set;
forming a residual word vector set by the unselected word vectors in the word vector set;
selecting residual word vectors from the residual word vector set, and executing the following cluster center set processing generation steps:
determining the distance between the selected residual word vector and each cluster center in the cluster center set to obtain a distance set;
determining a minimum distance in the set of distances;
adding the minimum distance as a candidate distance to a candidate distance set;
in response to the fact that the number of the alternative distances in the alternative distance set is equal to the number of the word vectors in the remaining word vector set, selecting the remaining word vectors corresponding to the maximum distance in the alternative distance set to be added to the cluster center set;
outputting the cluster center set in response to the number of cluster centers in the cluster center set reaching a predetermined number;
and in response to the cluster center quantity in the cluster center set not reaching the preset number, recombining the residual word vectors which are not selected in the residual vector set into a residual word vector set, selecting the residual word vectors from the recombined residual word vector set, and executing the step of generating the processing cluster center set again.
8. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
9. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.
CN202010671714.0A 2020-07-14 2020-07-14 Structured information display method and device, electronic equipment and computer readable medium Active CN111563117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010671714.0A CN111563117B (en) 2020-07-14 2020-07-14 Structured information display method and device, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010671714.0A CN111563117B (en) 2020-07-14 2020-07-14 Structured information display method and device, electronic equipment and computer readable medium

Publications (2)

Publication Number Publication Date
CN111563117A CN111563117A (en) 2020-08-21
CN111563117B true CN111563117B (en) 2020-11-20

Family

ID=72070171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010671714.0A Active CN111563117B (en) 2020-07-14 2020-07-14 Structured information display method and device, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN111563117B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779316A (en) * 2021-02-19 2021-12-10 北京沃东天骏信息技术有限公司 Information generation method and device, electronic equipment and computer readable medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10162840B1 (en) * 2015-09-02 2018-12-25 Jpmorgan Chase Bank, N.A. Method and system for aggregating financial measures in a distributed cache
CN108920465A (en) * 2018-07-13 2018-11-30 福州大学 A kind of agriculture field Relation extraction method based on syntactic-semantic
CN110298032B (en) * 2019-05-29 2022-06-14 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
CN110826303A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Joint information extraction method based on weak supervised learning
CN111046670B (en) * 2019-12-09 2023-04-18 大连理工大学 Entity and relationship combined extraction method based on drug case legal documents

Also Published As

Publication number Publication date
CN111563117A (en) 2020-08-21

Similar Documents

Publication Publication Date Title
CN110969012B (en) Text error correction method and device, storage medium and electronic equipment
CN108733317B (en) Data storage method and device
CN109933217B (en) Method and device for pushing sentences
CN110688528A (en) Method, apparatus, electronic device, and medium for generating classification information of video
CN111046252B (en) Information processing method, device, medium, electronic equipment and system
CN115757400A (en) Data table processing method and device, electronic equipment and computer readable medium
CN111563117B (en) Structured information display method and device, electronic equipment and computer readable medium
CN110852057A (en) Method and device for calculating text similarity
US20240078387A1 (en) Text chain generation method and apparatus, device, and medium
CN110598049A (en) Method, apparatus, electronic device and computer readable medium for retrieving video
CN113946648B (en) Structured information generation method and device, electronic equipment and medium
CN113987118A (en) Corpus acquisition method, apparatus, device and storage medium
CN111737572B (en) Search statement generation method and device and electronic equipment
CN111681088B (en) Information pushing method and device, electronic equipment and computer readable medium
CN109800438B (en) Method and apparatus for generating information
CN113379503A (en) Recommendation information display method and device, electronic equipment and computer readable medium
CN114792086A (en) Information extraction method, device, equipment and medium supporting text cross coverage
CN112651231A (en) Spoken language information processing method and device and electronic equipment
CN111737571A (en) Searching method and device and electronic equipment
CN111737587A (en) Device operation method, device, electronic device and computer readable medium
CN111784377A (en) Method and apparatus for generating information
CN115328811B (en) Program statement testing method and device for industrial control network simulation and electronic equipment
CN115098647B (en) Feature vector generation method and device for text representation and electronic equipment
CN111930704B (en) Service alarm equipment control method, device, equipment and computer readable medium
CN117172220B (en) Text similarity information generation method, device, equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231110

Address after: 518000, 705, No. 121 Minsheng Avenue, Shangcun Community, Gongming Street, Guangming District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen Hongli Intellectual Property Service Co.,Ltd.

Address before: 100102 room 801, 08 / F, building 7, yard 34, Chuangyuan Road, Chaoyang District, Beijing

Patentee before: BEIJING MISSFRESH E-COMMERCE Co.,Ltd.