CN114818686A

CN114818686A - Text recommendation method based on artificial intelligence and related equipment

Info

Publication number: CN114818686A
Application number: CN202210507428.XA
Authority: CN
Inventors: 陈凡
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-07-29

Abstract

The application provides a text recommendation method and device based on artificial intelligence, electronic equipment and a storage medium, wherein the text recommendation method based on artificial intelligence comprises the following steps: segmenting a text to obtain a plurality of words; encoding the vocabulary to obtain encoded data; classifying the encoded data to obtain a plurality of semantic groups; calculating the similarity between the coded data corresponding to the vocabulary to be evaluated and each semantic group, and taking the semantic group corresponding to the maximum similarity as a target group; classifying the coded data in the target group to obtain a plurality of candidate groups; calculating the association index of each candidate group; and recommending vocabularies corresponding to the coded data in the candidate group in sequence according to the sequence from high to low of the correlation indexes. According to the method, a plurality of candidate groups are obtained through double classification, and the association index of each candidate group is calculated, so that words can be recommended based on the association index, and the accuracy of text recommendation is improved.

Description

Text recommendation method based on artificial intelligence and related equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text recommendation method and apparatus, an electronic device, and a storage medium based on artificial intelligence.

Background

With the rapid development of information technology and internet, the ways for people to acquire knowledge in daily life are more and more diversified. In recent years, there has been an increasing demand for retrieving relevant text contents through the internet and keywords.

At present, text recommendation is usually performed by a method of adding category labels to texts in advance and retrieving related texts according to the labels, however, in this way, a large amount of manpower is consumed to preprocess text data, the manpower cost is high, and the accuracy is low.

Disclosure of Invention

In view of the foregoing, there is a need to provide a text recommendation method based on artificial intelligence and related devices, so as to solve the technical problem of how to improve the accuracy of text recommendation, where the related devices include an artificial intelligence based text recommendation apparatus, an electronic device and a storage medium.

The embodiment of the application provides a text recommendation method based on artificial intelligence, which comprises the following steps:

segmenting words of texts in a natural language database by using a preset word segmentation tool to obtain a plurality of words;

coding the vocabularies to obtain coded data corresponding to each vocabulary;

classifying the encoded data to obtain a plurality of semantic groups, each semantic group containing a plurality of encoded data;

calculating the similarity between the coded data corresponding to the vocabulary to be evaluated and each semantic group, and taking the semantic group corresponding to the maximum similarity as a target group;

classifying the coded data in the target group to obtain a plurality of candidate groups, wherein each candidate group comprises a plurality of coded data;

calculating a correlation index of each candidate group, wherein the correlation index is used for representing the recommended degree of the coded data in each candidate group;

and recommending vocabularies corresponding to the coded data in the candidate group in sequence according to the sequence from high to low of the correlation indexes.

According to the text recommendation method based on artificial intelligence, the words are coded to obtain the quantitative expression of the text, the accuracy of text recommendation can be improved, the coded data are classified for multiple times to obtain the optimized classification result, the target group data are further screened out from the optimized classification result according to the words to be evaluated, the target group data are subjected to secondary classification to obtain a plurality of candidate group data, the association index of each candidate group is evaluated, the words are recommended to the user according to the association index, and therefore the accuracy of text recommendation can be improved.

In some embodiments, said classifying said encoded data to obtain a plurality of semantic groups comprises:

constructing a center list according to a preset upper classification limit, and selecting a plurality of coded data as center data according to each element in the center list;

classifying all the coded data for multiple times according to the central data to obtain multiple classification results, wherein each classification result comprises multiple vocabulary classification groups;

calculating the sum of squares of errors of each classification result to serve as an evaluation value, and calculating the difference value of the evaluation values to construct an evaluation result hash table;

and searching the classification result corresponding to the minimum difference value in the evaluation result hash table, and taking all vocabulary classification groups in the classification result as semantic groups.

Therefore, a plurality of classification results are obtained by classifying the encoded data, each classification result comprises a plurality of encoded data, the classification results are evaluated to screen out an optimized classification result, and the vocabulary classification groups in the optimized classification results are used as semantic groups, so that the accuracy of vocabulary classification is improved.

In some embodiments, the classifying the encoded data according to the central data to obtain a classification result, the classification result including a plurality of vocabulary classification groups, includes:

a, marking all the coded data as 0 and randomly selecting one coded data as current data;

b, respectively calculating the Euclidean distance between the current data and each central data;

c, marking the current data as 1 and classifying the current data into a category where the central data corresponding to the smaller Euclidean distance is located;

d, randomly selecting one coded data marked as 0 as the current data again, repeating the steps b to d until all the coded data are marked as 1, stopping iteration and obtaining a plurality of vocabulary classification groups;

e, calculating the mean value of all the coded data in each vocabulary classification group, taking the vocabulary classification group as a classification result if the mean value meets a preset judgment condition, and otherwise taking the coded data closest to the mean value in each vocabulary classification group as central data and repeating the steps a to e to obtain the classification result.

Therefore, the encoded data are clustered in an iterative mode to obtain a plurality of vocabulary classification groups, the vocabulary corresponding to the encoded data can be divided into a plurality of vocabulary classification groups with similar semantics without manually marking the vocabulary, and the efficiency of text recommendation is improved.

In some embodiments, the calculating a sum of squared errors for each classification result as an evaluation value, and calculating a difference value of the evaluation values to construct an evaluation result hash table includes:

respectively calculating the error of each vocabulary classification group, wherein the error is the square sum of Euclidean distances between all coded data in each vocabulary classification group and central data;

respectively calculating the sum of errors of all vocabulary classification groups in each classification result to serve as an evaluation value of each classification result;

and arranging the evaluation values according to the number of vocabulary classification groups in each classification result from small to large, calculating the difference value of two adjacent evaluation values, and combining the difference values according to the order of the evaluation values to construct an evaluation result hash table.

Therefore, by calculating the error of each vocabulary classification group in each classification result and further obtaining the evaluation value of each classification result, the evaluation result hash table is constructed according to the evaluation values, so that the subsequent screening of the more optimized classification results is facilitated, and the text recommendation efficiency can be improved.

In some embodiments, the classifying the encoded data in the target set to obtain a plurality of candidate sets comprises:

a, classifying the coded data in a target group according to a preset radius threshold and a preset density threshold, wherein the classes comprise core data and outlier data;

b, setting a judgment condition according to the radius threshold, wherein the judgment condition means that the relation between the two data meets any one of preset relations, and the preset relations comprise: the density direct refers to that the Euclidean distance between two coded data is not less than the radius threshold value; the density reach means that the relation between two coded data is not density reach and the two coded data have common density reach coded data; the density connection refers to the coded data of which the relation between the two coded data is not density reachable and has common density reachable;

c, marking all core data as 'unaccessed';

d, optionally selecting one core data marked as 'unaccessed' as target data;

e, sequentially traversing all core data labeled as 'unaccessed', if the traversed core data meet the judgment condition, classifying the core data and the target data into the same candidate group, and marking all the core data in the candidate group as 'accessed';

and f, repeating the steps d and e to obtain a plurality of candidate groups, and taking all outlier data as one candidate group.

Therefore, a plurality of candidate groups are obtained by further classifying the encoded data in the target group, so that the vocabulary corresponding to the encoded data can be more finely divided, and the accuracy of text recommendation is improved.

In some embodiments, the evaluating each candidate set to obtain the correlation index for each candidate set comprises:

calculating the mean value of each candidate group, and taking the coded data with the minimum Euclidean distance from the mean value in each candidate group as the centroid of each candidate group;

calculating the variance of the coded data in each candidate group to serve as a polymerization degree, wherein the polymerization degree is used for representing the similarity degree of vocabularies corresponding to the coded data in the candidate groups;

calculating the similarity of the target data and the centroid data, wherein the higher the similarity is, the more similar the target data and the centroid data are;

and inputting the polymerization degree and the similarity into a self-defined normalization model to obtain a normalization result, and taking the normalization result as a correlation index of each candidate group, wherein the correlation index is used for representing the correlation degree of the coded data in the candidate group and the target data.

Therefore, the correlation index corresponding to each candidate group is obtained by inputting the polymerization degree of the coded data in each candidate group and the similarity between each candidate group and the target data into the self-defined normalization model, and data support is provided for subsequent text recommendation, so that the accuracy of text recommendation can be improved.

In some embodiments, the customized normalized model satisfies the relationship:

wherein, T represents the correlation index, and the higher the value of the correlation index is, the more words corresponding to the coded data in the corresponding candidate group should be recommended; x represents the polymerization degree of the candidate group, and the higher the polymerization degree is, the higher the importance of the vocabulary corresponding to the coded data in the candidate group is; y represents the similarity between the target data and the candidate group, and the higher the similarity is, the more words corresponding to the encoded data in the candidate group should be recommended; alpha represents a preset harmonic constant; e represents a natural constant.

Therefore, the correlation index of each candidate group is obtained through the self-defined normalization model, data support is provided for subsequent text recommendation, and the accuracy of text recommendation can be improved.

The embodiment of the present application further provides a text recommendation device based on artificial intelligence, the device includes:

the word segmentation unit is used for segmenting words of texts in the natural language database by using a preset word segmentation tool to obtain a plurality of words;

the encoding unit is used for encoding the vocabularies in the natural language database to obtain encoded data corresponding to each vocabulary;

a first classification unit configured to classify the encoded data to obtain a plurality of semantic groups, each semantic group containing a plurality of encoded data;

the first calculating unit is used for calculating the similarity between the coded data corresponding to the vocabulary to be evaluated and each semantic group, and taking the semantic group corresponding to the maximum similarity as a target group;

a second classification unit, configured to classify the encoded data in the target group to obtain multiple candidate groups, where each candidate group includes multiple encoded data;

the second calculating unit is used for calculating a correlation index of each candidate group, and the correlation index is used for representing the recommended degree of the coded data in each candidate group;

and the recommending unit is used for sequentially recommending the vocabularies corresponding to the coded data in the candidate group according to the sequence from high to low of the correlation indexes.

An embodiment of the present application further provides an electronic device, where the device includes:

a memory storing computer readable instructions; and

a processor executing computer readable instructions stored in the memory to implement the artificial intelligence based text recommendation method.

Embodiments of the present application further provide a computer-readable storage medium, in which computer-readable instructions are stored, and the computer-readable instructions are executed by a processor in an electronic device to implement the artificial intelligence based text recommendation method.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of an artificial intelligence based text recommendation method to which the present application relates.

FIG. 2 is a functional block diagram of a preferred embodiment of an artificial intelligence based text recommender to which the present application is directed.

Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the artificial intelligence based text recommendation method according to the present application.

Detailed Description

For a clearer understanding of the objects, features and advantages of the present application, reference will now be made in detail to the present application with reference to the accompanying drawings and specific examples. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict. In the following description, numerous specific details are set forth to provide a thorough understanding of the present application, and the described embodiments are merely a subset of the embodiments of the present application and are not intended to be a complete embodiment.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The embodiment of the present application provides a text recommendation method based on artificial intelligence, which can be applied to one or more electronic devices, where the electronic device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and hardware of the electronic device includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a programmable gate array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The electronic device may be any electronic product capable of performing human-computer interaction with a user, for example, a personal computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a game machine, an interactive web television (IPTV), an intelligent wearable device, and the like.

The electronic device may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud computing (cloud computing) based cloud consisting of a large number of hosts or network servers.

The network where the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a Virtual Private Network (VPN), and the like.

FIG. 1 is a flow chart of a preferred embodiment of the text recommendation method based on artificial intelligence according to the present application. The order of the steps in the flow chart may be changed and some steps may be omitted according to different needs.

S10, performing word segmentation on the text in the natural language database by using a preset word segmentation tool to obtain a plurality of words;

in this alternative embodiment, the natural language database refers to a data set designed, stored and managed according to a data structure, and for example, the natural language database may be a MySQL database, which is an open-source database and functions to store any natural language text.

In this alternative embodiment, the preset word segmentation tool may be a jieba word segmentation tool, and the jieba word segmentation tool is a code written by a Python program, and functions to separate words in a natural language text to obtain a plurality of words. Cut (presence), wherein cut represents a word segmentation instruction in the jieba word segmentation tool, and presence represents the natural language text.

Illustratively, when the sensor is "everything is just asleep", the output of the jieba segmentation tool is a list, each element in the list is a vocabulary, the list is in the form of [ everything, all, just, asleep, like ].

Therefore, the natural language text is preprocessed through the preset word segmentation tool to obtain a plurality of words, the natural language text can be represented in a fine-grained mode, a foundation is provided for subsequent word coding, and therefore the accuracy of text recommendation can be improved.

And S11, coding the words to obtain coded data corresponding to each word.

In this alternative embodiment, each vocabulary in the natural language database may be encoded according to a preset text encoding method, where the preset text encoding method may be a one-hot algorithm, which means a one-hot encoding algorithm, and the one-hot encoding algorithm functions to encode each vocabulary in the natural language database into binary vector data, which may be in the form of [0,1,0, 0] for example.

In this alternative embodiment, the weight corresponding to each vocabulary in the natural language database may be calculated according to a preset vocabulary weight algorithm, where the weight is used to represent the importance of each vocabulary in the natural language database, and the higher the value of the weight is, the more important the corresponding vocabulary is in the natural language database. The preset vocabulary weight algorithm can be a TF-IDF algorithm, the TF-IDF algorithm is named as term frequency-inverse document frequency algorithm in a full name, the TF-IDF algorithm can output the weight corresponding to each vocabulary, and the value of the weight can be any real number.

In this alternative embodiment, the product of the weight and the binary vector data of the corresponding word may be calculated to obtain the encoded data corresponding to each word, for example, if the word "i" corresponds to binary data of [0,1,0,0,0] and the weight thereof is 0.23, the word "i" corresponds to encoded data of [0,0.23,0,0,0 ].

In this alternative embodiment, the vocabulary may be used as a key, the encoded data corresponding to the vocabulary may be used as values to construct key-value pairs, and all the key-value pairs may be used as an encoded hash table.

Illustratively, when the vocabulary is "I" and its corresponding encoded data is [0,0.23,0,0,0], then the key-value pair is in the form of (I, [0,0.23,0,0,0 ]).

In this way, the encoded data corresponding to each vocabulary is obtained by calculating the product of the importance corresponding to each vocabulary and the binary vector data corresponding to each vocabulary, and the encoded data can represent text data in a quantitative manner, so that the accuracy of subsequent text recommendation is improved.

S12, classifying the coded data to obtain a plurality of semantic groups, wherein each semantic group comprises a plurality of coded data.

In this optional embodiment, the classifying the encoded data to obtain a plurality of semantic groups, each semantic group including a plurality of encoded data includes:

In this alternative embodiment, the center list may be constructed according to a preset upper classification limit, where the preset upper classification limit is a positive integer greater than 2, and the upper classification limit is used to control the number of times of classifying the encoded data.

In this optional embodiment, each element in the center list is a positive integer, where each element represents the number of categories obtained by classifying the encoded data each time, and since the encoded data needs to be classified into multiple categories, the minimum value of the element in the center list is 2.

In this alternative embodiment, each positive integer greater than or equal to 2 may be sequentially used as an element in the center list, and the preset upper limit of classification may be used as the last element in the center list, and if the preset upper limit of classification is n, the center list has n-1 elements.

Illustratively, when the preset upper classification limit is 9, the central list is [2,3,4,5,6,7,8,9], where the value of each element is used to characterize the number of classes obtained after classifying the encoded data each time.

In this alternative embodiment, the elements in the center list may be sequentially selected as the number of classifications, and the encoded data with the number equal to the number of classifications may be randomly selected from the encoded data as the center data, for example, if the center list is [2,3,4,5,6,7,8,9], then the first classification number selected is2, and then two encoded data may be randomly selected from all the encoded data as the center data.

In an optional embodiment, classifying all the encoded data a plurality of times according to the central data to obtain a plurality of classification results comprises:

e, calculating the mean value of all the coded data in each vocabulary classification group, taking the vocabulary classification group as a classification result if all the mean values meet the preset judgment condition, and otherwise taking the coded data which is closest to the mean value in each vocabulary classification group as central data and repeating the steps a to e to obtain the classification result.

In this optional embodiment, the specific implementation steps of classifying all the encoded data for multiple times to obtain multiple classification results include:

a1: marking all the coded data as 0, and randomly selecting one coded data as current data, wherein the current data corresponds to a certain vocabulary in the natural language database;

a2: calculating the euclidean distance between the current data and each central data, for example, if the current data is [0.23,0,0,0] and one central data is [0,0.65,0,0], the euclidean distance between the current data and the central data is calculated by:

the euclidean distance between the current data and the center data is 0.4754.

The Euclidean distance is used for representing the similarity degree between the current data and the central data, and the smaller the Euclidean distance is, the higher the similarity degree between the vocabulary represented by the current data and the vocabulary represented by the central data is;

a3: marking the current data as 1 and classifying the current data into the vocabulary section where the central data corresponding to the smaller Euclidean distance is locatedClass group, for example, if there are four central data, and the euclidean distance between the current data and each central data is respectively recorded as Dis ¹ ＝0.4754、Dis ² ＝0.57、Dis ³ 0.65 and Dis ⁴ 0.32, where the superscript for each euclidean distance represents the index for each center datum. Since the minimum value of the four Euclidean distances is Dis ⁴ 0.32, so the current data is classified as the vocabulary classification group where the 4 th center data is located;

a4: randomly selecting one coded data marked as 0 as current data, and repeating the steps A2 to A4 until all the coded data are marked as 1 and a plurality of vocabulary classification groups are obtained;

a5: calculating the mean value of all the coded data in each vocabulary classification group, taking the plurality of vocabulary classification groups as classification results if all the mean values meet preset discrimination conditions, otherwise taking the coded data closest to the mean value in each vocabulary classification group as central data, and repeating the steps A1-A5 to obtain the classification results.

The preset judgment condition means that the euclidean distance between the mean value and the central data is smaller than a preset threshold, for example, the preset threshold may be 0.1, and the purpose of this step is to ensure that the change of the classification result is small enough, that is, the performance of the classification result is excellent.

In this optional embodiment, each element in each central list corresponds to one classification result, each classification result corresponds to a plurality of vocabulary classification groups, each vocabulary classification group includes a plurality of encoded data, each encoded data corresponds to a vocabulary, and the encoded data in each vocabulary classification group corresponds to a vocabulary of the same semantic meaning.

In an alternative embodiment, calculating a sum of squared errors for each classification result as an evaluation value, and calculating a difference value of the evaluation values to construct an evaluation result hash table includes:

calculating the sum of all errors in each classification result to serve as an evaluation value of each classification result;

In this alternative embodiment, the error of each vocabulary classification group in each classification result may be calculated separately, where the error is the sum of squares of euclidean distances between all encoded data in each vocabulary classification group and the central data, and the euclidean distances are calculated as follows:

where A represents the central data, B represents the encoded data, and i represents an index of the dimensions of the central data and encoded data, such that A represents the central data, B represents the encoded data, and i represents the index of the dimensions of the central data and encoded data _i A value representing the ith dimension of the central data, B _i The value of the ith dimension representing the coded data, m represents the total number of the dimensions of the center data and the coded data, and Dis represents the Euclidean distance, wherein the smaller the Euclidean distance is, the higher the similarity between the center data and the coded data is, and the more similar the vocabulary represented by the center data and the vocabulary represented by the coded data is.

For example, if the central data of a certain vocabulary classification group is [0,0.23,0,0] and one of the encoded data is [0.45,0,0,0], the euclidean distance between the central data and the encoded data is calculated by:

the euclidean distance between the central data and the encoded data is 0.5.

For example, if a certain vocabulary classification group includes four encoded data and a central data, and the euclidean distances between the four encoded data and the central data are 0.5, 1, 1.5, and 2, respectively, the error corresponding to the vocabulary classification group is calculated as follows:

0.5 ² +1 ² +1.5 ² +2 ² ＝7.5

the error corresponding to the vocabulary classification group is 7.5, and the smaller the error is, the higher the aggregation degree of all the encoded data in the vocabulary classification group is, the more similar the vocabularies corresponding to all the encoded data in the vocabulary classification group are.

In this alternative embodiment, the sum of all errors in each classification result may be calculated as an evaluation value for each classification result, and a smaller evaluation value indicates a better effect of the corresponding classification result. For example, if a classification result includes four vocabulary classification groups, and the errors of the four vocabulary classification groups are 7.5, 9.5, 10.5, and 11.5, respectively, the evaluation value corresponding to the classification result is 39.

In this alternative embodiment, the number of vocabulary classification groups in each classification result may be used as a key, and the evaluation value corresponding to each classification result may be used as a value to construct a key-value pair, and the key-value pair may be arranged according to the value of the key from small to large.

Illustratively, if the evaluation value corresponding to the number of vocabulary classification groups of 2 is 100, the evaluation value corresponding to the number of vocabulary classification groups of 3 is 60, and the evaluation value corresponding to the number of vocabulary classification groups of 4 is 50, the key value pairs are arranged in the order of (2,100), (3,60), (4, 50).

In this alternative embodiment, the difference between the two adjacent evaluation values may be calculated, and for example, if the evaluation value obtained when the number of vocabulary classification groups is2 is 100 and the evaluation value obtained when the number of vocabulary classification groups is3 is 60, the difference is 40.

In this alternative embodiment, the difference values may be combined in order of evaluation values to construct an evaluation result hash table, which is, for example, [ ([2,3],40), ([3,4],10) ] if the difference values are 40 and 10, respectively.

In this alternative embodiment, the key corresponding to the smallest value in the evaluation result hash table may be selected as the target key, and for example, if the evaluation result hash table is [ ([2,3],40), ([3,4],10) ], the smallest value is 10 and the corresponding key is [3,4 ].

In this alternative embodiment, the vocabulary classification group in the classification result corresponding to the smaller value in the target key may be selected as the semantic group.

For example, if the target key is [3,4], three vocabulary classification groups in the classification result corresponding to 3 may be selected as the semantic group.

Therefore, a plurality of classification results are obtained by classifying the encoded data for a plurality of times, the error of each classification result is calculated to construct an evaluation result hash table, and a semantic group is obtained by inquiring the classification result corresponding to the minimum value in the evaluation result hash table, so that the efficiency and the accuracy of vocabulary classification can be considered.

And S13, calculating the similarity between the coded data corresponding to the vocabulary to be evaluated and each semantic group, and taking the semantic group corresponding to the maximum similarity as a target group.

In this alternative embodiment, the vocabulary to be evaluated refers to the vocabulary that needs to be associated in the natural language database, and the vocabulary to be evaluated may be used as a key, and the encoded hash table is traversed sequentially to find a value corresponding to the key, so as to serve as the target data.

For example, when the vocabulary to be evaluated is "description", the vocabulary "description" may be used as a key, and all keys in the encoded hash table are sequentially traversed to query the encoded data corresponding to the vocabulary "description", and the encoded data corresponding to the vocabulary "description" may be used as the target data, and the target data may be in the form of [0,0.23,0,0,0 ].

In this optional embodiment, the similarity between the target data and the central data of each semantic group may be used as the similarity between the target data and the semantic group, and the semantic group corresponding to the minimum similarity may be selected as the target group.

In this alternative embodiment, the euclidean distance between the target data and the central data of each semantic group may be calculated, and the reciprocal of the euclidean distance is used as the similarity, and the higher the similarity is, the more similar the target data and the central data of the semantic group are, the higher the similarity is between the vocabulary represented by the target data and the vocabulary in the semantic group is.

In this optional embodiment, if the target data is a and the semantic group center data is B, the similarity calculation method includes:

wherein S represents the similarity between the target data A and the semantic group central data B, and the higher the value of S is, the higher the similarity between the vocabulary represented by the target data A and the vocabulary represented by the central data B is; i represents the dimension index of the target data and the central data, n represents the dimension number of the target data and the central data, A _i A value representing the ith dimension of the target data, B _i A value representing the ith dimension in the semantic group center data.

Illustratively, when the target data is [0,0.23,0,0,0] and the semantic group center data is [0,0.45,0,0,0], the similarity is calculated by:

in this optional embodiment, the semantic group corresponding to the maximum similarity may be used as the target group, for example, if there are 4 semantic groups, and the similarities between the target data and the central data of each semantic group are 1, 2,3, and 4, respectively, the semantic group corresponding to the similarity 4 may be used as the target group.

Therefore, target data are obtained by inquiring coded data corresponding to the vocabulary to be evaluated, a target group is selected according to the similarity between the target data and the central data of the semantic group, the similarity between the data in the target group and the target data is higher, and the accuracy of vocabulary association can be improved.

S14, classifying the coded data in the target group to obtain a plurality of candidate groups, wherein each candidate group comprises a plurality of coded data.

In an optional embodiment, the classifying the data in the target group to obtain a plurality of candidate groups includes:

a, classifying the coded data in the target group according to a preset radius threshold and a preset density threshold to obtain a classification result, wherein the classification result comprises core data and outlier data; in this optional embodiment, taking any one of the encoded data in the target group as an example, when the encoded data is taken as a center and the number of the encoded data included in a range with the radius threshold as a radius is not less than the density threshold, the encoded data is taken as core data, otherwise, the encoded data is outlier data, in this scheme, the radius threshold may be 0.1 and the density threshold may be 4, for example, if the number of the encoded data in a range with the encoded data T as a center and with the radius threshold of 0.1 as a radius is not less than 4, the encoded data T is taken as core data;

b, setting a discrimination condition according to the radius threshold, wherein the discrimination condition means that the relationship between two coded data meets any one of preset relationships, and the preset relationships comprise: the density is direct, can reach and is connected. The direct density means that the Euclidean distance between two coded data is not less than the radius threshold; the density reach means that the relation between two coded data is not density reach and the two coded data have common density reach coded data; the density concatenation refers to encoded data in which the relationship between two encoded data is not density reachable and has a common density reachable. When the preset radius threshold value is 0.5, the Euclidean distance between the coded data E and the coded data F is 0.4, and the relation between the coded data E and the coded data F is that the density is direct; if the Euclidean distance between the encoded data G and the encoded data F is 0.4, the density of the relationship between the encoded data G and the encoded data F is direct, and if the Euclidean distance between the encoded data G and the encoded data E is more than 0.5, the density of the relationship between the encoded data G and the encoded data E is reachable; when the relation between the encoded data H and the encoded data E is not density-reachable but the relation between the encoded data H and the encoded data G is density-reachable, the relation between the encoded data H and the encoded data E is density-connected.

c, marking all core data as 'unaccessed'; in this optional embodiment, the encoded data E, the encoded data F, the encoded data G, and the encoded data H may all be set as core data, and then the encoded data E, the encoded data F, the encoded data G, and the encoded data H may all be marked as "unaccessed";

d, optionally selecting one core data marked as 'unaccessed' as target data;

e, sequentially traversing all core data labeled as 'unaccessed', if the traversed core data meet the judgment condition, namely meet any one of the preset relations, classifying the core data and the target data into the same candidate group, and marking all the core data in the candidate group as 'accessed'; in this optional embodiment, taking the encoded data E as target data, taking traversed encoded data H as an example, because the relationship between the encoded data H and the target data E is density connection, the encoded data H meets the criterion, and the encoded data H and the target data E are classified into the same candidate group, and all the encoded data in the candidate group are marked as "visited";

and f, repeating the steps d and e to obtain a plurality of candidate groups until all the coded data in the target group are marked as accessed, finishing the classification and obtaining a plurality of candidate groups, and taking all the outlier data as one candidate group.

Therefore, a plurality of candidate groups are obtained by further classifying the coded data in the target group, each candidate group comprises a plurality of coded data, each coded data represents a vocabulary in the natural language database, the coded data in each candidate group is used for representing vocabularies belonging to the same class, the vocabulary classification result can be refined, and the vocabulary association accuracy can be further improved.

And S15, calculating a correlation index of each candidate group, wherein the correlation index is used for representing the recommended degree of the coded data in each candidate group.

In an alternative embodiment, the calculating the association index of each candidate group includes:

In this alternative embodiment, a mean value of each candidate group may be calculated, and the encoded data with the minimum euclidean distance from the mean value in each candidate group is taken as the centroid of each candidate group, and for example, when a certain candidate group includes 4 encoded data and is [0,0.23,0,0], [0.45,0,0,0], [0,0,0.65,0], and [0,0, 0.55], the mean value of the candidate group is calculated in the following manner:

then the mean value is [0.11,0.06,0.16,0.13], noting that the euclidean distances of the 4 encoded data from the mean value are Dis1, Dis2, Dis3, and Dis4, respectively, then the euclidean distances of the 4 encoded data from the mean value are:

then the code vector 0,0.23,0,0 may be selected as the centroid of the candidate set.

In this alternative embodiment, the variance of the encoded data in each candidate group may be calculated as the aggregation level of each candidate group, and a higher aggregation level indicates that the vocabulary corresponding to the encoded data in the candidate group has a larger influence on the semantics of the vocabulary in the target group, and the vocabulary in the candidate group is more important and should be recommended. The variance is calculated in the following way:

wherein, A represents a certain coded data in the candidate group; b represents the mean of the candidate group; j represents an index of the encoded data in the candidate set; a. the _j Representing the jth coded data in the candidate set; k represents the number of coded data in the candidate set; x represents the variance of the candidate group, namely the aggregation level of the coded data in the candidate group, and the higher the value of x is, the lower the aggregation level of the coded data in the candidate group is, and the higher the influence degree of the vocabulary corresponding to the coded data in the candidate group on the vocabulary semantics is.

For example, if a candidate group includes 4 encoded data [0,0.23,0,0], [0.45,0,0,0], [0,0,0, 0.65,0], and [0,0,0,0.55], and the mean of the encoded data in the candidate group is [0.11,0.06,0.16,0.13], the variance of the encoded data in the candidate group is calculated by:

the variance represents a degree of polymerization of 0.9534.

In this alternative embodiment, a similarity between the target data and the centroid data may be calculated, where the similarity may be an inverse of the euclidean distance and may be denoted as y, and the calculation method is as follows:

wherein y represents the similarity between the target data C and the centroid data D, and a higher value of y indicates a higher degree of similarity between the vocabulary represented by the target data C and the vocabulary represented by the centroid data D; i represents a dimension index of the target data and the centroid data, n represents a number of dimensions of the target data and the center data, C _i A value representing the ith dimension of the target data, D _i A value representing an ith dimension in the centroid data.

Illustratively, when the target data is [0,0.23,0,0,0] and the centroid data is [0,0.45,0,0,0], the similarity is calculated by:

the similarity of the target data and the centroid data is 4.54.

In this optional embodiment, the customized normalization model satisfies the following relation:

wherein, T represents the correlation index, and the higher the value of the correlation index is, the more words corresponding to the coded data in the corresponding candidate group should be recommended; x represents the polymerization degree of the candidate group, and the higher the polymerization degree is, the higher the importance of the vocabulary corresponding to the coded data in the candidate group is; y represents the similarity between the target data and the candidate group, and the higher the similarity is, the more words corresponding to the coded data in the candidate group should be recommended; alpha represents a preset harmonic constant, and can be 5 according to multiple test experiences; e represents a natural constant.

For example, when the harmonic constant is 5, the degree of polymerization of a candidate group is 0.9534, and the similarity between the target data and the centroid data of the candidate group is 4.54, the method for calculating the association index of the candidate group is as follows:

the correlation index corresponding to all the encoded data in the candidate set is 0.408.

Therefore, the relevance index corresponding to each candidate group is obtained by calculating the polymerization degree of each candidate group, respectively calculating the similarity between the centroid of each candidate group and the target data, and inputting the polymerization degree and the similarity into a self-defined normalization model, and the higher the value of the relevance index is, the stronger the relevance degree between the vocabulary represented by the coded data in the corresponding candidate group and the vocabulary represented by the target data is, so that the accuracy of text recommendation is improved.

And S16, sequentially recommending the vocabularies corresponding to the coded data in the candidate group according to the sequence from high to low of the correlation index.

In this alternative embodiment, the correlation index of each candidate group may be compared with a preset threshold value to obtain a comparison result of each candidate group, where the comparison result includes greater than and less than. If the correlation index is greater than a preset threshold, the coded data in the candidate group is retained, and if the correlation index is not greater than the preset threshold, the coded data in the candidate group is deleted, because the output result of the normalization model is a real number between 0 and 1, that is, the value of the correlation index is a real number between 0 and 1, the preset threshold may be 0.5.

In this alternative embodiment, the retained candidate group may be sorted according to the association index, if the higher the association index value of the candidate group is, the earlier the order of the candidate group is, the vocabulary corresponding to the encoded data in the candidate group may be further recommended according to the order of the candidate group, and the earlier the order of the candidate group is, the earlier the vocabulary corresponding to the encoded data in the candidate group is recommended to the user.

Therefore, the candidate groups are sequenced according to the association indexes corresponding to the candidate groups to obtain the sequence of the candidate groups, the vocabularies corresponding to the coded data in the candidate groups are recommended to the user according to the sequence, the vocabularies can be evaluated in a quantification mode, and the accuracy of text recommendation is improved.

Fig. 2 is a functional block diagram of a preferred embodiment of an artificial intelligence based text recommendation apparatus according to an embodiment of the present application. The artificial intelligence based text recommendation device comprises a word segmentation unit 110, a coding unit 111, a first classification unit 112, a first calculation unit 113, a second classification unit 114, a second calculation unit 115 and a recommendation unit 116. The module/unit referred to in this application refers to a series of computer program segments that can be executed by the processor 13 and that can perform a fixed function, and that are stored in the memory 12. In the present embodiment, the functions of the modules/units will be described in detail in the following embodiments.

In an alternative embodiment, the segmentation unit 110 is configured to segment the text in the natural language database by using a preset segmentation tool to obtain a plurality of words.

In this alternative embodiment, the preset word segmentation tool may be a jieba word segmentation tool, and the jieba word segmentation tool is a code written by a Python program, and functions to separate words in a natural language text to obtain a plurality of words. Cut (sensor), wherein cut represents word segmentation instructions in the jieba word segmentation tool, and sensor represents the natural language text.

In an optional embodiment, the encoding unit 111 is configured to encode the vocabulary to obtain encoded data corresponding to each vocabulary.

In an alternative embodiment, the first classification unit 112 is configured to classify the encoded data to obtain a plurality of semantic groups, each semantic group containing a plurality of encoded data.

e, calculating the mean value of all the coded data in each vocabulary classification group, taking the vocabulary classification group as a classification result if all the mean values meet a preset judgment condition, and otherwise taking the coded data closest to the mean value in each vocabulary classification group as central data and repeating the steps a to e to obtain the classification result.

the euclidean distance between the current data and the center data is 0.4754.

a3: the current data is labeled as 1 and classified as the vocabulary classification group where the center data corresponding to the smaller Euclidean distance is located, for example, if there are four centersData, and the Euclidean distance between the current data and each central data is respectively recorded as Dis ¹ ＝0.4754、Dis ² ＝0.57、Dis ³ 0.65 and Dis ⁴ 0.32, where the superscript for each euclidean distance represents the index for each center datum. Since the minimum value of the four Euclidean distances is Dis ⁴ 0.32, so the current data is classified as the vocabulary classification group where the 4 th center data is located;

respectively calculating the error of each vocabulary classification group, wherein the error is the square sum of the Euclidean distances between all encoding data in each vocabulary classification group and the central data;

the euclidean distance between the central data and the encoded data is 0.5.

0.5 ² +1 ² +1.5 ² +2 ² ＝7.5

In an alternative embodiment, the first calculating unit 113 calculates the similarity between the encoded data corresponding to the vocabulary to be evaluated and each semantic group, and takes the semantic group corresponding to the maximum similarity as the target group.

In an optional embodiment, the second classifying unit 114 is configured to classify the encoded data in the target group to obtain a plurality of candidate groups, and each candidate group includes a plurality of encoded data.

b, setting a discrimination condition according to the radius threshold, wherein the discrimination condition means that the relationship between two coded data meets any one of preset relationships, and the preset relationships comprise: the density is direct, can reach and is connected. The direct density means that the Euclidean distance between two coded data is not less than the radius threshold; the density reach refers to coded data of which the relation between two coded data is not density reach and the two coded data have common density reach; the density concatenation refers to encoded data in which the relationship between two encoded data is not density reachable and has a common density reachable. When the preset radius threshold value is 0.5, the Euclidean distance between the coded data E and the coded data F is 0.4, and the relation between the coded data E and the coded data F is that the density is direct; if the Euclidean distance between the encoded data G and the encoded data F is 0.4, the density of the relationship between the encoded data G and the encoded data F is direct, and if the Euclidean distance between the encoded data G and the encoded data E is more than 0.5, the density of the relationship between the encoded data G and the encoded data E is reachable; when the relation between the encoded data H and the encoded data E is not density-reachable but the relation between the encoded data H and the encoded data G is density-reachable, the relation between the encoded data H and the encoded data E is density-connected.

d, optionally selecting one core data marked as 'unaccessed' as target data;

In an alternative embodiment, the second calculating unit 115 is configured to calculate a correlation indicator for each candidate set, the correlation indicator being used to represent the degree to which the encoded data in each candidate set should be recommended.

the variance represents a degree of polymerization of 0.9534.

wherein y represents the similarity between the target data C and the centroid data D, and a higher value of y indicates the vocabulary and the centroid represented by the target data CThe higher the degree of similarity between the words represented by the data D; i represents a dimension index of the target data and the centroid data, n represents a number of dimensions of the target data and the center data, C _i A value representing the ith dimension of the target data, D _i A value representing an ith dimension in the centroid data.

the similarity of the target data and the centroid data is 4.54.

In an alternative embodiment, the recommending unit 116 is configured to recommend the words corresponding to the encoded data in the candidate group in order from high to low according to the relevance indicator.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 1 comprises a memory 12 and a processor 13. The memory 12 is used for storing computer readable instructions, and the processor 13 is used for executing the computer readable instructions stored in the memory to implement the artificial intelligence based text recommendation method of any of the above embodiments.

In an alternative embodiment, the electronic device 1 further comprises a bus, a computer program, such as an artificial intelligence based text recommendation program, stored in the memory 12 and executable on the processor 13.

Fig. 3 only shows the electronic device 1 with components 12-13, and it will be understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

In conjunction with fig. 1, memory 12 in electronic device 1 stores a plurality of computer-readable instructions to implement an artificial intelligence based text recommendation method, and processor 13 may execute the plurality of instructions to implement:

coding the vocabulary in the natural language database to obtain coded data corresponding to each vocabulary;

Specifically, the specific implementation method of the instruction by the processor 13 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.

It will be understood by those skilled in the art that the schematic diagram is merely an example of the electronic device 1, and does not constitute a limitation to the electronic device 1, the electronic device 1 may have a bus-type structure or a star-type structure, and the electronic device 1 may further include more or less hardware or software than those shown in the figures, or different component arrangements, for example, the electronic device 1 may further include an input and output device, a network access device, etc.

It should be noted that the electronic device 1 is only an example, and other existing or future electronic products, such as those that may be adapted to the present application, should also be included in the scope of protection of the present application, and are included by reference.

Memory 12 includes at least one type of readable storage medium, which may be non-volatile or volatile. The readable storage medium includes flash memory, removable hard disks, multimedia cards, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, for example a removable hard disk of the electronic device 1. The memory 12 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash memory card (FlashCard), and the like, provided on the electronic device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 12 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of an artificial intelligence based text recommendation program, etc., but also to temporarily store data that has been output or is to be output.

The processor 13 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital processing chips, graphics processors, and combinations of various control chips. The processor 13 is a control unit (control unit) of the electronic device 1, connects various components of the whole electronic device 1 by various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules stored in the memory 12 (for example, executing a text recommendation program based on artificial intelligence, etc.), and calling data stored in the memory 12.

The processor 13 executes an operating system of the electronic device 1 and various types of application programs installed. The processor 13 executes the application program to implement the steps in the various artificial intelligence based text recommendation method embodiments described above, such as the steps shown in fig. 1.

Illustratively, the computer program may be partitioned into one or more modules/units, which are stored in the memory 12 and executed by the processor 13 to complete the application. The one or more modules/units may be a series of computer-readable instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the electronic device 1. For example, the computer program may be segmented into a participle unit 110, an encoding unit 111, a first classification unit 112, a first calculation unit 113, a second classification unit 114, a second calculation unit 115, a recommendation unit 116.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the artificial intelligence based text recommendation method according to the embodiments of the present application.

The integrated modules/units of the electronic device 1 may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the processes in the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer-readable storage medium and executed by a processor, to implement the steps of the embodiments of the methods described above.

Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, Read-only memory (ROM), random access memory and other memory, etc.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 3, but this does not indicate only one bus or one type of bus. The bus is arranged to enable connected communication between the memory 12 and at least one processor 13 or the like.

The present application further provides a computer-readable storage medium (not shown), in which computer-readable instructions are stored, and the computer-readable instructions are executed by a processor in an electronic device to implement the artificial intelligence based text recommendation method according to any of the above embodiments.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the specification may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present application and not for limiting, and although the present application is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application.

Claims

1. A text recommendation method based on artificial intelligence, characterized in that the method comprises:

coding the vocabularies to obtain coded data corresponding to each vocabulary;

2. The artificial intelligence based text recommendation method of claim 1, wherein said classifying the encoded data to obtain a plurality of semantic groups comprises:

constructing a center list according to a preset upper classification limit, wherein each element in the center list is a positive integer, and selecting a plurality of coded data as center data according to each element in the center list;

3. The artificial intelligence based text recommendation method of claim 2, wherein said classifying all encoded data a plurality of times in accordance with the central data to obtain a plurality of classification results, each classification result containing a plurality of lexical classification groups comprises:

d, randomly selecting one piece of coded data marked as 0 as current data again, repeating the steps b to d until all the coded data are marked as 1, stopping iteration and obtaining a plurality of vocabulary classification groups;

4. The artificial intelligence based text recommendation method of claim 2, wherein the calculating a sum of squared errors for each classification result as an evaluation value and calculating a difference of the evaluation values to construct an evaluation result hash table comprises:

5. The artificial intelligence based text recommendation method of claim 1, wherein the classifying the encoded data in the target group to obtain a plurality of candidate groups comprises:

a, classifying the coded data in the target group according to a preset radius threshold and a preset density threshold to obtain a classification result, wherein the classification result comprises core data and outlier data;

b, setting a judgment condition according to the radius threshold, wherein the judgment condition means that the relation between the two data meets any one of preset relations, and the preset relations comprise: the density direct refers to that the Euclidean distance between two coded data is not less than the radius threshold value; the density reach means that the relation between two coded data is not density reach and the two coded data have common density reach coded data; the density connection refers to that the relation between two coded data is not density reachable coded data and has common density reachable coded data;

c, marking all core data as 'unaccessed';

d, optionally selecting one core data marked as 'unaccessed' as target data;

6. The artificial intelligence based text recommendation method of claim 1, wherein said calculating an association indicator for each candidate set comprises:

7. The artificial intelligence based text recommendation method of claim 6, wherein the customized normalization model satisfies a relation:

8. An artificial intelligence based text recommendation apparatus, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises:

a memory storing computer readable instructions; and

a processor executing computer readable instructions stored in the memory to implement the artificial intelligence based text recommendation method of any of claims 1-7.

10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein computer-readable instructions that are executed by a processor in an electronic device to implement the artificial intelligence based text recommendation method of any of claims 1-7.