WO2023071120A1 - Method for recognizing proportion of green assets in digital assets and related product - Google Patents

Method for recognizing proportion of green assets in digital assets and related product Download PDF

Info

Publication number
WO2023071120A1
WO2023071120A1 PCT/CN2022/090224 CN2022090224W WO2023071120A1 WO 2023071120 A1 WO2023071120 A1 WO 2023071120A1 CN 2022090224 W CN2022090224 W CN 2022090224W WO 2023071120 A1 WO2023071120 A1 WO 2023071120A1
Authority
WO
WIPO (PCT)
Prior art keywords
assets
text
proportion
digital
digital assets
Prior art date
Application number
PCT/CN2022/090224
Other languages
French (fr)
Chinese (zh)
Inventor
诸世卓
崔伟旗
刘琛
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023071120A1 publication Critical patent/WO2023071120A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of artificial intelligence technology, and specifically relates to a method for identifying the proportion of green assets in digital assets and related products.
  • the embodiments of the present application provide a method for identifying the proportion of green assets in digital assets and related products, so as to improve the identification accuracy of the proportion of green assets in digital assets.
  • the embodiment of the present application provides a method for identifying the proportion of green assets in digital assets based on text recognition, including: performing text recognition on the acquired position data of digital assets to be identified, and obtaining a plurality of first digital assets and second digital assets, wherein the asset information of each of the first digital assets is disclosed in the position data, and the asset information of the second digital assets is not disclosed in the position data; according to each of the Asset information of a digital asset, obtaining the disclosure data of each of the first digital assets, and inputting the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation to obtain at least one first text segment, wherein , the at least one first text segment is used to describe the asset distribution of each of the first digital assets; according to the similarity model, the similarity between each of the first text segments and a plurality of second text segments is determined, Wherein, the multiple second text segments are used to describe multiple fund distributions with green attributes; according to the similarity between each of the first text segments and the multiple second text segments, the
  • the embodiment of the present application provides an identification device for the proportion of green assets, including: an acquisition unit and a processing unit; the acquisition unit is used to acquire the position data of the digital assets to be identified; the processing unit uses After performing text recognition on the obtained position data of digital assets to be identified, a plurality of first digital assets and second digital assets are obtained, wherein the asset information of each of the first digital assets is disclosed in the position data, so The asset information of the second digital asset is not disclosed in the position data; the acquiring unit is further configured to acquire the disclosed data of each of the first digital assets according to the asset information of each of the first digital assets; The processing unit is further configured to input the disclosure data of each of the first digital assets into the machine reading comprehension model for text segmentation to obtain at least one first text segment, wherein the at least one first text segment is used to describe each The asset distribution of the first digital asset; according to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe
  • an embodiment of the present application provides an electronic device, which includes: a processor and a memory, the processor is connected to the memory, the memory is used to store computer programs, and the processor is used to execute the A computer program stored in the memory to cause the electronic device to perform the following steps:
  • each of the first digital assets obtain the disclosure data of each of the first digital assets, and input the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation, to obtain at least one first digital asset a text segment, wherein the at least one first text segment is used to describe the asset distribution of each of the first digital assets;
  • the similarity model determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
  • the portrait of the manager of the digital asset to be identified obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor so that the computer performs the following steps:
  • each of the first digital assets obtain the disclosure data of each of the first digital assets, and input the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation, to obtain at least one first digital asset a text segment, wherein the at least one first text segment is used to describe the asset distribution of each of the first digital assets;
  • the similarity model determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
  • the portrait of the manager of the digital asset to be identified obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
  • an embodiment of the present application provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer is operable to enable the computer to execute the computer program described in the first aspect.
  • FIG. 1 is a schematic flowchart of a method for identifying the proportion of green assets in digital assets based on text recognition provided by an embodiment of the present application;
  • Fig. 2 is a schematic diagram of the position data of a fund provided by the embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for identifying the proportion of green assets in stocks provided by an embodiment of the present application
  • Fig. 4 is a schematic flow chart of a similarity model training method provided by the embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a method for identifying the proportion of green assets in a bond provided by an embodiment of the present application
  • FIG. 6 is a block diagram of functional units of an identification device for the proportion of green assets provided by the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the application scenario of this application is to identify the green assets in the fund. Therefore, the unidentified digital assets involved in this application are unidentified stocks.
  • Funds are generally composed of multiple stocks, multiple bonds, and other fixed income.
  • the asset information of each stock is fully disclosed, such as the name of the stock, the proportion of the stock, and the net value of the stock. are fully disclosed, etc.; however, for bonds, not all asset information is disclosed. For example, some bonds disclose the name, proportion, net value, etc. of the bond.
  • multiple digital assets including multiple stocks and multiple disclosed bonds
  • digital assets that do not disclose asset information that is, the undisclosed bond is called the second digital asset.
  • FIG. 1 is a method for identifying the proportion of green assets in digital assets based on text recognition provided by an embodiment of the present application.
  • the method is applied to the identification device of the proportion of green assets.
  • the method includes the following steps:
  • the position data of the digital asset to be identified is acquired from the platform of the issuing company of the digital asset to be identified or from the third-party management platform of the digital asset to be identified by crawler technology.
  • the second digital asset that is, the second digital asset is included in the digital asset to be identified by default, so there is no need to perform text recognition on the position data, and the second digital asset is included in the digital asset to be identified by default.
  • 103 Obtain the disclosure data of each first digital asset according to the asset information of each first digital asset, and input the disclosure data of each first digital asset into the machine reading comprehension model for text segmentation to obtain at least one first text segment, Wherein, at least one first text segment is used to describe the asset distribution of each first digital asset.
  • the asset name of each first digital asset is obtained according to the position data, and then the disclosure data of each first digital asset is obtained through crawler technology based on the asset name.
  • the disclosure data of the first digital asset is the disclosure document of the stock, that is, the annual report issued by the company to which the stock belongs, and the fund distribution described in the first text paragraph is the The proportion of the sub-products of the enterprise; performing text segmentation on the annual report of the enterprise based on the machine reading comprehension model to obtain the at least one first text segment.
  • I will describe in detail later how to segment the text of the annual report and how to obtain the proportion of green assets in stocks, so I won’t describe too much here.
  • the disclosure data of the first digital asset is the disclosure data of the bond, that is, the disclosure data of the bond issuer when the bond is raised, and the first The distribution of funds described in the text paragraph is the use of funds for this bond. Therefore, the at least one first text segment is obtained by performing text segmentation on the bond disclosure data through a machine reading comprehension model. I will describe in detail later how to segment the disclosed data and how to obtain the proportion of green assets in the disclosed bonds, so I won’t go into too much detail here.
  • the similarity model determine the similarity between each first text segment and multiple second text segments, where the multiple second text segments are used to describe multiple fund distributions with green attributes.
  • the multiple funds described in the multiple second text paragraphs are distributed into multiple industries with green attributes, referred to as multiple first industries.
  • the multiple funds described in the multiple text paragraphs are distributed as multiple fund uses with green attributes.
  • the proportion of the sub-product described in the target first text paragraph is used as the proportion of green assets in each first digital asset.
  • the proportion of the funds planned in the fund use described in the first text paragraph of the target to the total amount of the first digital asset is taken as the proportion of green assets in the first digital asset Compare.
  • text recognition can be performed on the position data to obtain the first ratio; then, according to each The first proportion of the first digital asset and the proportion of the green asset determine the first proportion of the green asset in each first digital asset relative to the net value of the digital asset to be identified.
  • the first proportion of each first digital asset can be expressed by formula (1):
  • the position data will disclose the total ratio of each first digital asset relative to the net value of the digital asset to be identified. Therefore, the second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified can be determined according to the position data and the first ratio of each first digital asset.
  • the second ratio of the second digital asset can be expressed by formula (2):
  • HP b2 is the second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified, is the ratio of the i-th first digital asset to the net value of the digital asset to be identified, and m is the number of multiple first digital assets.
  • the second ratio of the second digital asset and the ratio of the green asset determine the second ratio of the green asset in the second digital asset relative to the net value of the digital asset to be identified.
  • the second proportion of the second digital asset can be expressed by formula (3):
  • FG b2 is the second proportion of the second digital asset, is the proportion of green assets in the second digital asset.
  • the first proportion of each first digital asset and the second proportion of the second digital asset are summed to obtain the proportion of green assets among the digital assets to be identified.
  • the proportion of green assets in digital assets to be identified can be expressed by formula (4):
  • FG is the proportion of green assets among the digital assets to be identified.
  • text recognition can also be performed on the position data to obtain the total amount of some of the first digital assets among the multiple first digital assets , the total amount of the second digital asset, and the total amount of the digital asset to be identified, wherein, this part of the first digital asset is the disclosed bond among the multiple first digital assets; perform text recognition on the position data, and obtain a part The total net value of the first digital asset, the total net value of the second digital asset, and the total net value of the digital asset to be identified; determine the sum of the total amount of the first digital asset and the total amount of the second digital asset, relative to the number to be identified The third ratio of the total amount of assets, that is, Pb v ; determine the sum of the total net value of the first digital asset and the total net value of the second digital asset, relative to the fourth ratio of the total net value of the digital asset to be identified, that is, Pb npv ; Determine the leverage ratio according to the third ratio and the fourth
  • the reason for calculating the leverage ratio is because when calculating the proportion of green assets in bonds, the bond assets used are bond assets after leverage has been added, resulting in a relatively high proportion in the statistics. Therefore, it is necessary to remove the leverage Impact. Therefore, according to the leverage ratio, the first proportion of some of the first digital assets and the second proportion of the second digital asset are respectively deleveraged to obtain the first target proportion of some of the first digital assets and the second proportion of the second digital asset. The second target proportion; finally, the first proportion of another part of the first digital assets in the multiple first digital assets (that is, the stocks in the multiple first digital assets), the first proportion of some of the first digital assets The target proportion and the second target proportion of the second digital asset are summed to obtain the proportion of green assets in the digital assets to be identified.
  • the green ratio of digital assets to be identified can be expressed by formula (5):
  • m 1 is the quantity of another part of the first digital asset
  • m 2 is the quantity of a part of the first digital asset
  • m 1 +m 2 m.
  • the digital asset to be identified is any one of the multiple digital assets to be identified held by the investment institution at time t, that is, any one of the multiple funds held by the investment institution .
  • the proportion of green assets in each digital asset to be identified among the plurality of digital assets to be identified may be determined.
  • the green scale of each digital asset to be identified held by an investment institution can be expressed by formula (6):
  • S i is the green scale of the i-th unidentified digital asset held by the investment institution
  • FG i is the proportion of green assets in the i-th unidentified digital asset
  • V i is described
  • R i is the share of the i-th digital asset to be identified held by the investment institution at time t.
  • FIG. 3 is a schematic flowchart of a method for identifying the proportion of green assets in stocks provided by an embodiment of the present application.
  • the content in this embodiment is the same as that in the embodiment shown in FIG. 1 , and will not be described again here.
  • the method of the present embodiment comprises the following steps:
  • the 301 Perform text recognition on the disclosure documents of each first digital asset to obtain the target chapters in the disclosure documents, wherein the target chapters are used to describe the main products of the company to which each first digital asset belongs, and the target chapters include target tables and the target text segment.
  • the disclosure document is the annual report of the issuing company of the first digital asset for the first digital asset.
  • the chapter “I. Overview” in the chapter “Section Four Discussion and Analysis of Business Situation” in the company's annual report is used to describe the company's main products. Therefore, text recognition is performed on the disclosure document, and the chapter “Section 4 Discussion and Analysis of Business Situation” is located; then, text recognition is performed on this chapter to obtain subdivided chapters under this chapter, that is, the chapter "I. Overview", And use this subdivision chapter as the target chapter.
  • the target section includes a first target table and a target text segment, wherein the target text segment is used to describe the main product of the enterprise to which it belongs; the target table is used to describe the main product and the turnover of the main product relative to The proportion of the total turnover of the affiliated enterprise, that is, the proportion of the main product.
  • the entity recognition is performed on the target text segment, the entity related to the product is obtained, and the product corresponding to the entity is used as the main product of the enterprise to which it belongs.
  • the target text segment describes that the main product of the affiliated enterprise is "new energy battery”
  • the target text segment describes that the main product of the affiliated enterprise is "new energy battery”
  • the machine reading comprehension (Machine Reading Comprehension, MRC) model is pre-trained, and this application does not describe the process of training the MRC model.
  • the problem of setting the MRC model at first is: "Which products are the sub-products (i.e.
  • the main product of the MRC model and set the article input by the MRC model as the target text segment; then, encode the question through the encoding layer of the MRC model to obtain the first vector; encode each sub-text segment in the target text segment, Obtain the second vector corresponding to each subtext segment; then, input the first vector and the second vector of each subtext segment to the interactive layer of the MRC model for interaction, and obtain the similarity between the question and each subtext segment, A subtext segment whose similarity is greater than a preset threshold is used as the at least one first text segment.
  • At least one sub-product under the main product can be obtained.
  • the target text segment may describe multiple main products and sub-products under each main product.
  • the main product described includes “new energy battery” and “wind power generation”, then for the main product "new energy battery”, after inputting the target text segment into the MRC model, the first output text segment is A text segment used to describe the battery, for example, at least one identified first text segment is respectively used to describe "lithium battery”, “nuclear battery”, and other new energy batteries.
  • the proportion of the main product can be evenly split to the at least one sub-product, to obtain the proportion of each sub-product in the at least one sub-product.
  • the sub-product can be further split, and the proportion of the sub-product can be split to finer-grained products.
  • the main product is split once as an example, and multiple splits are not performed.
  • the proportion of main product A is 50%, and the main product A includes sub-product b and sub-product c, then the proportion of sub-product b and sub-product c are both 25%. Further, if sub-product b includes sub-product d and sub-product e, the proportion of sub-product b can be divided equally, and the proportions of sub-product d and sub-product e are 12.5% and 12.5% respectively.
  • the similarity model determine the similarity between each first text segment and multiple second text segments, wherein the multiple second text segments describe a plurality of products as products with a green attribute.
  • the first preset document is obtained, for example, the first preset document may be "Explanation of the Green Industry Guidance Catalog", and the products recorded in the first preset document all have green attributes;
  • the default document performs entity recognition to obtain the industry (that is, the product) recorded in the preset document;
  • the read product is regarded as a product with green attributes.
  • the first preset document when the first preset document records products, it may not directly record products with green attributes, but record products with green attributes through document references through other documents. Therefore, firstly, text recognition is performed on the first preset document to obtain a plurality of third text segments, wherein the plurality of third text segments are used to describe the products described in the first preset document, but a certain third text segment When describing a product, it does not directly describe the product, but refers to other documents describing the product.
  • any third text segment in multiple third text segments refers to other documents
  • text recognition is performed on other documents to obtain a fourth text segment corresponding to the third text segment, wherein the fourth text segment is other documents
  • the similarity model is obtained by training multiple pairs of target training samples constructed in advance.
  • the process of constructing multiple pairs of target training samples and the model training process will be described in detail later, and no further description will be given here.
  • the similarity model may be a RoFormer model.
  • each first text segment determines the maximum similarity corresponding to each first text segment, and if the maximum similarity is greater than the similarity threshold, the The first text segment is used as the target first text segment, that is, it is determined that the sub-product described by the target first text segment is the product with the green attribute described by the second text segment corresponding to the maximum similarity.
  • the proportion of the sub-product described in the target first text paragraph is used as the proportion of green assets in each first digital asset.
  • the number of the target first text segment may be one or more, that is to say, one or more sub-products in the at least one sub-product have a green attribute.
  • the proportions of the sub-products described by the multiple target sub-text segments are summed, and the summation result is used as the green asset in each first digital asset proportion.
  • FIG. 4 is a schematic flowchart of a similarity model training method provided by an embodiment of the present application.
  • the content in this embodiment is the same as that in the embodiment shown in FIG. 3 , and will not be described again here.
  • the method of the present embodiment comprises the following steps:
  • the products recorded in the second preset document include products with green attributes and products with non-green attributes.
  • the second preset document is obtained through crawler technology, for example, the second preset document may be "2017 National Economic Industry Classification Catalog 2021 Revised First Edition". All current products on the market are recorded in the second preset document. Therefore, the products recorded in the second default document include products with green attributes and products with non-green attributes.
  • 402 Perform text recognition on the second preset document to obtain multiple fifth text segments, where the multiple fifth text segments are used to describe products recorded in the second preset document.
  • entity recognition is performed on the second preset document to obtain each product recorded in the second preset document; text segments describing each product are extracted from the second preset document through text recognition to obtain multiple fifth text segment.
  • 403 Construct multiple pairs of target training samples according to multiple fifth text segments and multiple second text segments.
  • synonym replacement is performed on entities in each second text segment in multiple second text segments to obtain a sixth text segment corresponding to each second text segment; then, each second text segment , and the sixth text segment corresponding to the second text segment is used as a pair of training samples to obtain multiple pairs of first training samples.
  • multiple pairs of first training samples may also be referred to as multiple pairs of similar samples.
  • a plurality of target fifth text segments among the plurality of fifth text segments are eliminated to obtain a plurality of seventh text segments, wherein the products described in the plurality of target fifth text segments are different from those described in the plurality of second text segments
  • the products are the same, and the multiple target fifth text segments are in one-to-one correspondence with the multiple second text segments.
  • the multiple fifth text segments are subtracted from the multiple second text segments to obtain the multiple seventh text segments.
  • the difference set referred to in this application is essentially the difference set of the industry described by the text paragraphs, that is, the target fifth text paragraphs are removed from multiple fifth text paragraphs to obtain the multiple seventh text paragraphs.
  • the products described in the obtained plurality of seventh text segments are all products with non-green attributes.
  • the product described in the seventh text segment is the same as the product described in the second text segment, but the product described in the seventh text segment has a non-green color attribute, while the product described by the second text paragraph has a green attribute.
  • the product described in the second text paragraph is "energy-saving industrial boiler”
  • the product described in the seventh text paragraph is "industrial boiler”. It can be seen that the products described in these two text paragraphs are both boilers, but "energy-saving industrial boilers” have green attributes, while “industrial boilers” have non-green attributes. Therefore, these two text segments can be used as a pair of training samples. Therefore, the seventh text segment and the second text segment corresponding to the seventh text segment are used as a pair of training samples to obtain multiple pairs of second training samples. In this application, multiple pairs of second training samples may be referred to as multiple pairs of dissimilar samples.
  • multiple pairs of first training samples and multiple pairs of second training samples are used as the multiple pairs of target training samples.
  • each training sample in each pair of target training samples among multiple pairs of target training samples is respectively input into the initial model to obtain a feature vector of each training sample, wherein the feature vector is used to determine the The probability that the described product has a green attribute; then, according to the feature vector of each training sample and the label of each training sample, the first loss corresponding to each training sample is determined, wherein the label of each training sample is used to identify The truth about whether the product described by each training sample has the green attribute.
  • the labels of the two training samples in each pair of similar samples are the same, and for dissimilar samples, the labels of the two training samples in each pair of dissimilar samples are different.
  • the classifier of the initial model determines the probability that the product described by each training sample has the green attribute; according to the probability of the product described by each training sample having the green attribute and each labels of training samples, and determine the first loss corresponding to each training sample.
  • the second loss of each pair of target training samples that is, according to the feature vectors of the two training samples in each pair of target training samples, determine the similarity between the two training samples degree, and use this similarity degree as the second loss for each pair of target samples.
  • the initial model is trained to obtain the similarity model.
  • the first target loss of the initial model in the process of classifying the green attributes is determined.
  • weighted summation is performed on the first losses of all training samples in multiple pairs of target training samples to obtain the first target loss.
  • the first target loss can be expressed by formula (7):
  • L 1 is the first target loss
  • avg is the averaging operation
  • n is the number of pairs of first training samples
  • m is the number of pairs of second training samples
  • W is the weight of the classifier of the initial model
  • f t ' is The t-th training sample among all the training samples in the multi-pair target training samples (ie 2(n+m)) training samples
  • l t is the label of the t-th training sample.
  • the loss of the initial model in the process of feature extraction for each pair of first training samples is determined to obtain the second target loss.
  • the second loss of each pair of first training samples is obtained, and the second loss of multiple pairs of first training samples is averaged to obtain the second target loss.
  • the second target loss can be expressed by formula (8):
  • L sim is the second target loss
  • avg is the averaging operation
  • n is the number of pairs of first training samples
  • S i is the i-th pair of first training samples in n pairs of first training samples
  • 2 is an operation for calculating the similarity (distance) between the vectors.
  • the loss of the initial model in the process of feature extraction for each pair of second training samples is determined to obtain the third target loss.
  • the second loss of each pair of second training samples is obtained, and the second loss of multiple pairs of second training samples is averaged to obtain the third target loss.
  • the third target loss can be expressed by formula (9):
  • L dissim is the third target loss
  • avg is the averaging operation
  • m is the number of pairs of second training samples
  • S j is the jth pair of first training samples in m pairs of second training samples
  • 2 is an operation for calculating the similarity (distance) between the vectors.
  • a fourth target loss is determined according to the second target loss and the third target loss.
  • the fourth target loss is expressed by formula (10):
  • L 4 is the fourth loss
  • k is a preset stability parameter, which is used to prevent the fourth target loss L 4 from being zero when L sim is 0, thereby preventing model degradation.
  • the reason why the loss function of formula (10) is set is because in the process of constructing training sample pairs, it is determined that the second target loss L sim needs to be optimized towards a relatively small direction, and the third target loss L dissim needs to be optimized towards a relatively large direction to optimize, so the simple weighted summation cannot unify the two.
  • the loss function of formula (10) is set, only optimize towards the direction of the fourth target loss L 4 which is relatively small, which can meet the optimization requirements of the second target loss L sim and the third target loss L dissim , thereby satisfying the entire Optimization requirements for the backpropagation process.
  • the fourth target loss and the first target loss are weighted to obtain the final target loss; the initial model is reversely updated based on the target loss and the gradient descent method until the initial model converges to obtain the similarity model.
  • sentence pattern replacement when constructing similar training samples, in addition to synonym replacement, sentence pattern replacement can also be performed.
  • entity recognition is performed on multiple second text segments to obtain multiple target entities, wherein the multiple target entities are in one-to-one correspondence with multiple second text segments, that is, extracted from multiple second text segments A plurality of target entities used to describe the plurality of first products.
  • each second text segment and the target entity extracted from each second text segment are used as a pair of training samples to obtain multiple pairs of similar samples, thus constructing similar samples containing different sentence patterns. For example, "this bond will be used to repay the loan of the previous hydropower station construction project", then the second text segment and "hydropower station" will be used as a pair of similar samples.
  • a target entity is randomly selected from the remaining target entities, and the second text segment is used as a pair of dissimilar samples, which can be Multiple pairs of dissimilar samples are constructed, wherein the remaining target entities are all entities in the multiple target entities except the target entity of the second text segment. For example, by randomly replacing the above-mentioned "hydropower station” with a target entity, such as "wind station", “other project construction”, etc., multiple pairs of dissimilar samples can be constructed. Constructing such dissimilar samples allows the model to learn that what needs to be paid attention to is the entity in the sentence pattern. For this dissimilar data entity, it needs to be classified into different products.
  • the model recognizes "this bond will be used to repay the previous hydropower station construction project loan” and "wind power station” and “other project construction” as products with different attributes, so that the most similar situation can be accurately matched in such a similar situation
  • the most popular industry is hydropower stations, which can accurately match entities, thereby improving the recognition accuracy of the model.
  • FIG. 5 is a schematic flowchart of a method for identifying the proportion of green assets in a bond provided by an embodiment of the present application.
  • the content in this embodiment is the same as the embodiment shown in FIG. 1 , FIG. 3 , and FIG. 4 , and will not be described again here.
  • the method of the present embodiment comprises the following steps:
  • the first digital asset here is a part of the first digital assets of the multiple first digital assets, that is, the disclosed bonds among the multiple first digital assets.
  • first digital asset Before determining the proportion of green assets in each first digital asset, it is possible to determine whether the first digital asset has green attributes as a whole. If it is determined that the first digital asset does not have green attributes, the first digital asset can be directly determined The proportion of green assets in assets is 0, and if it is determined that the second digital asset has green attributes, then determine the proportion of green assets in the first digital asset.
  • the preset keyword set is a set of keywords that have green attributes and are related to bonds, that is, a set of keywords obtained by extracting keywords from the bond names of each green bond.
  • the preset keyword A set of words may include: "green bond", “carbon neutral”, “energy efficient”, etc. That is, determine whether each bond has green attributes from the bond name, that is, determine whether each bond is a green bond.
  • the company to which each first digital asset belongs determines the company to which each first digital asset belongs, that is, identify the issuing company of each bond from the position data; then, determine the industry to which the company belongs, for example, the The industry to which the main business product of the affiliated enterprise belongs shall be the industry to which the affiliated enterprise belongs. Finally, it is determined whether the industry to which it belongs belongs to an industry in a preset industry set, and if so, it is determined that the first digital asset has a green attribute, wherein the preset industry set is a set composed of industries with green attributes.
  • a preset document can be obtained, such as "Green Bond Support Project Catalogue", and then entity extraction can be performed on the preset document to obtain one or more green industries related to green, such as public transportation, sewage treatment, etc. ; Then, combine these green industries into a set to get the preset industry set. That is to determine whether the bond is a green bond from the industry to which the bond belongs.
  • the disclosed data of the first digital asset is: the type of the bond is "Guangzhou Metro Group Co., Ltd. 2020 Phase II Super-short-term Financing Bond", then it can be determined from the disclosed data that the issuing company of the bond is Guangzhou Metro Group Co., Ltd., and the industry of the issuing company is public transportation. Since public transportation is an industry in the preset industry set, it is determined that the first digital asset has a green attribute.
  • text recognition is performed on the disclosure data of each first digital asset, and a sixth text segment is identified from the disclosure data, wherein the sixth text segment is the first digital asset described in the disclosure data of the first digital asset.
  • a text segment for multiple funding purposes for an asset That is, through text positioning, find the text segment describing each fund use of the bond in the disclosed data, and then extract the text segment of each fund use from the disclosed data to obtain the sixth text segment; further, for the sixth text segment Perform semantic information extraction to obtain a third feature vector of the sixth text segment; then, predict the probability that the second digital asset has a green attribute according to the third feature vector; if the probability is greater than a second threshold, determine the second number Assets have green properties.
  • the above-mentioned method of determining whether the second digital asset has a green attribute can be realized through a trained model, which can be fasttext, textCNN, BERT model, etc., and this application does not limit it .
  • a trained model which can be fasttext, textCNN, BERT model, etc.
  • the text used to describe the use of funds is extracted from the bond sample, and the extracted text is used as a sample, and a label is added to the sample, and the label is used to identify whether the bond sample has a green attribute.
  • bond samples with green attributes and non-green attributes should be selected respectively to ensure that the constructed samples contain positive samples and negative samples; then, based on the extracted samples and the labels of the samples Carry out model training to obtain a prediction model for predicting whether a bond has a green attribute; finally, use the prediction model to extract semantic information from the sixth text segment to obtain the third feature vector of the sixth text segment, and pass the prediction The model processes the third feature vector to predict the probability that the second digital asset has a green attribute.
  • the name of the bond or the industry to which the bond belongs can be given priority to determine whether the bond has green attributes. .
  • each first digital asset has a green attribute
  • the proportion of green assets in each second digital asset can be identified.
  • a machine reading comprehension (Machine Reading Comprehension, MRC) model is trained in advance, and then the disclosure data of each first digital asset is input into the MRC model for text segmentation to obtain at least one first text segment.
  • MRC Machine Reading Comprehension
  • first set the problem to be solved by MRC as "which texts are used to describe the use of funds", and the input article is the disclosure data of each first digital asset; then, the problem is solved through the coding layer of the MRC model Encoding to obtain the first vector; encoding each text segment in the disclosed data through the encoding layer of the MRC model to obtain a second vector corresponding to each text segment; then, inputting the first vector and the second vector of each text segment Interact with the interaction layer of the MRC model to obtain the similarity between the question and each text segment, and use the text segment whose similarity is greater than the preset threshold as the at least one first text segment.
  • At least one first text segment as shown in Table 1 can be obtained.
  • the semantic information extraction model is pre-trained.
  • the training process of the semantic information extraction model is described below.
  • a training sample is constructed first. For example, extract text segments related to the use of funds from the disclosure data of multiple bonds, and label each text segment, where the label is used to identify the fact that the use of funds described in the text segment has a green attribute , where the use of the funds can be for green industries or non-green industries.
  • this initial model can be Bert model, and it comprises semantic information extraction model and multilayer perceptron (Multilayer Perceptron, MLP), wherein, the model parameter of this semantic information extraction model and multilayer perceptron are obtained by random initialization; then the training samples are input into the semantic information extraction model for semantic information extraction, and the fourth feature vector of the training sample is obtained; the fourth feature vector is input into the multi-layer perceptron, and the training sample belongs to the The probability of the industry with green attributes; finally, according to the probability that the training sample belongs to the industry with green attributes and the label of the training sample, the initial model is trained, that is, the semantic information extraction model and the model parameters of the multi-layer perceptron Adjustment is made to obtain the target model, and the multi-layer perceptron in the target model is deleted to obtain the semantic information extraction model.
  • MLP Multilayer Perceptron
  • each first text segment may be input into a semantic information extraction model for semantic information extraction to obtain a first feature vector of each first text segment.
  • the target model may not be deleted, and the entire target model may be retained directly; then, each fifth text segment is input into the target model for probability prediction, and each fifth text segment
  • the probability that the described fund use belongs to the green industry if the probability is greater than the probability threshold, the fifth text segment is determined to be the target fifth text segment, and the target first text segment can be directly determined without similarity calculation. Improve the identification efficiency of the proportion of green assets.
  • multiple industries with green attributes ie green industries
  • the entity is an industry
  • the multiple industries are regarded as the multiple primary industries
  • the user information is extracted from the PDF document.
  • the extraction model performs semantic information extraction to obtain a second feature vector of each second text segment.
  • the similarity between the first feature vector of each first text segment and the second feature vector of each second text segment can be determined, for example, the similarity can be obtained by the Euclidean formula between the two feature vectors distance representation, and use the similarity between two feature vectors as the similarity between each first text segment and each second text segment.
  • the maximum similarity corresponding to each first text segment is determined, and if the maximum similarity is greater than a threshold, the first text segment as the target first text segment. Specifically, if the maximum similarity is greater than the threshold, it means that the industry to which the fund use described in the first text paragraph belongs is the first industry described in the second text paragraph corresponding to the maximum similarity, that is, the industry supported by the fund use. The industry is a green industry, therefore, it can be determined that the use of funds has green attributes.
  • 506 Use the ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each first digital asset as the proportion of green assets in each first digital asset.
  • the number of the target first text segment is one or more, that is to say, the industries to which multiple fund uses among the multiple fund uses of each first digital asset have green attributes. Then, the proportion of the fund amount planned in the fund use described in the first text paragraph of each target to the total amount of each first digital asset can be used as the green ratio corresponding to the first text paragraph of each target; then, for each The sum of the green proportions of the first text segment of a target is obtained to obtain the proportion of green assets in the first digital asset.
  • FIG. 6 is a block diagram of functional units of a device for identifying the proportion of green assets provided by an embodiment of the present application.
  • the device 600 for identifying the proportion of green assets includes: an acquisition unit 601 and a processing unit 602;
  • An acquisition unit 601, configured to acquire position data of digital assets to be identified
  • the processing unit 602 is configured to perform text recognition on the acquired position data of the digital assets to be identified to obtain a plurality of first digital assets and second digital assets, wherein each of the first digital assets is disclosed in the position data The asset information of the second digital asset is not disclosed in the position data;
  • the obtaining unit 601 is further configured to obtain the disclosure data of each of the first digital assets according to the asset information of each of the first digital assets;
  • the processing unit 602 is further configured to input the disclosure data of each of the first digital assets into the machine reading comprehension model for text segmentation to obtain at least one first text segment, wherein the at least one first text segment is used to describe each asset distribution of the first digital asset;
  • the similarity model determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
  • the portrait of the manager of the digital asset to be identified obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
  • the asset distribution of each of the first digital assets is as follows: The proportion of the sub-products of the enterprise to which the digital asset belongs, the distribution of funds described in each of the second text paragraphs is a product with green attributes; after inputting the disclosed data of each of the first digital assets into the machine reading comprehension model for text
  • the processing unit 602 is specifically used for:
  • Target chapters are used to describe the main products of the companies to which each of the first digital assets belongs, and the target chapters include target tables and the target text segment;
  • each of the first text segments is used to describe a sub-product of the main product
  • the processing unit 602 specifically uses At:
  • Entity identification is performed on both the target text segment and the target form to obtain the proportion of the main product, where the proportion of the main product is the turnover of the main product and the value of the affiliated enterprise The ratio of the total turnover of
  • the acquiring unit 601 before determining the similarity between each of the first text segments and multiple second text segments according to the similarity model, is further configured to acquire the first preset document, The products recorded in the first preset document all have green attributes;
  • the processing unit 602 is further configured to perform text recognition on the first preset document to obtain multiple third text segments, wherein the multiple third text segments are used to describe the product;
  • any third text segment in the plurality of third text segments refers to other documents, perform text recognition on the other documents to obtain a fourth text segment corresponding to any one of the third text segments, wherein, The fourth text segment is the text used to describe products with green attributes in the other documents;
  • the initial model is trained according to the multiple pairs of target training samples to obtain the similarity model.
  • the processing unit 602 is specifically used for:
  • the first feature vector of each of the first text segments and the second feature vector of each of the second text segments determine the similarity between each of the first text segments and a plurality of second text segments
  • the processing unit 602 specifically uses At:
  • the ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each of the first digital assets is taken as the proportion of green assets in each of the first digital assets.
  • the green color in the digital assets to be identified is determined.
  • the processing unit 602 is specifically used for:
  • the first proportion of each of the first digital assets and the proportion of green assets determine the first proportion of the green assets of each of the first digital assets relative to the net value of the digital assets to be identified;
  • the second proportion of the second digital asset and the proportion of the green asset determine the second proportion of the green asset of the second digital asset relative to the net value of the digital asset to be identified
  • the processing unit 602 before summing the first proportion of each of the first digital assets and the second proportion of the second digital asset, the processing unit 602 is further configured to calculate the position data performing text recognition to obtain the total amount of some of the first digital assets among the plurality of first digital assets, the total amount of the second digital assets, and the total amount of the digital assets to be identified;
  • deleveraging is performed on the first ratio of the part of the first digital asset and the second ratio of the second digital asset to obtain the first target ratio of the part of the first digital asset and the second target ratio of the second digital asset;
  • the processing unit 602 In terms of summing the first proportion of each of the first digital assets and the second proportion of the second digital asset to obtain the proportion of green assets in the digital assets to be identified, the processing unit 602, Specifically for:
  • the first proportion of another part of the first digital assets in the plurality of first digital assets, the first target proportion of the part of the first digital assets, and the second target proportion of the second digital assets The sum is obtained to obtain the proportion of green assets in the digital assets to be identified.
  • FIG. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
  • an electronic device 700 includes a transceiver 701 , a processor 702 and a memory 703 . They are connected through a bus 704 .
  • the memory 703 is used to store computer programs and data, and can transmit the data stored in the memory 703 to the processor 702 .
  • the processor 702 is used to read the computer program in the memory 703 to perform the following operations:
  • the similarity model determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
  • the portrait of the manager of the digital asset to be identified obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
  • the asset distribution of each of the first digital assets is as follows: The proportion of the sub-products of the enterprise to which the digital asset belongs, the distribution of funds described in each of the second text paragraphs is a product with green attributes; after inputting the disclosed data of each of the first digital assets into the machine reading comprehension model for text
  • the processor 702 is specifically configured to perform the following operations:
  • Target chapters are used to describe the main products of the companies to which each of the first digital assets belongs, and the target chapters include target tables and the target text segment;
  • each of the first text segments is used to describe a sub-product of the main product
  • the processor 702 After determining the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, the processor 702 is specifically used to execute Do the following:
  • Entity identification is performed on both the target text segment and the target form to obtain the proportion of the main product, where the proportion of the main product is the turnover of the main product and the value of the affiliated enterprise The ratio of the total turnover of
  • the processor 702 before determining the similarity between each of the first text segments and multiple second text segments, the processor 702 is further configured to perform the following operations:
  • any third text segment in the plurality of third text segments refers to other documents, perform text recognition on the other documents to obtain a fourth text segment corresponding to any one of the third text segments, wherein, The fourth text segment is the text used to describe products with green attributes in the other documents;
  • the initial model is trained according to the multiple pairs of target training samples to obtain the similarity model.
  • the processor 702 is specifically configured to perform the following operations:
  • the first feature vector of each of the first text segments and the second feature vector of each of the second text segments determine the similarity between each of the first text segments and a plurality of second text segments
  • the processor 702 specifically uses to do the following:
  • the ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each of the first digital assets is taken as the proportion of green assets in each of the first digital assets.
  • the processor 702 is specifically configured to perform the following operations:
  • the first proportion of each of the first digital assets and the proportion of green assets determine the first proportion of the green assets of each of the first digital assets relative to the net value of the digital assets to be identified;
  • the second proportion of the second digital asset and the proportion of the green asset determine the second proportion of the green asset of the second digital asset relative to the net value of the digital asset to be identified
  • the processor 702 before summing the first percentages of the first digital assets and the second percentages of the second digital assets, the processor 702 is further configured to perform the following operations:
  • deleveraging is performed on the first ratio of the part of the first digital asset and the second ratio of the second digital asset to obtain the first target ratio of the part of the first digital asset and the second target ratio of the second digital asset;
  • the processor 702 specifically Used to do the following:
  • the first proportion of another part of the first digital assets in the plurality of first digital assets, the first target proportion of the part of the first digital assets, and the second target proportion of the second digital assets The sum is obtained to obtain the proportion of green assets in the digital assets to be identified.
  • the above-mentioned transceiver 701 may be the acquisition unit 601 of the green ratio recognition device 600 of the embodiment shown in FIG. 6, and the above-mentioned processor 702 may be the processing unit 602 of the green ratio recognition device 600 of the embodiment shown in FIG. 6 .
  • the electronic devices in this application may include smart phones (such as Android phones, iOS phones, Windows Phone phones, etc.), tablet computers, palmtop computers, notebook computers, mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc.
  • smart phones such as Android phones, iOS phones, Windows Phone phones, etc.
  • tablet computers palmtop computers
  • notebook computers mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc.
  • MID Mobile Internet Devices
  • wearable devices etc.
  • the above-mentioned electronic devices are only examples, not exhaustive, including but not limited to the above-mentioned electronic devices. In practical applications, the above-mentioned electronic devices may also include: smart vehicle-mounted terminals, computer equipment, and the like.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to realize any text-based recognition as described in the above-mentioned method embodiments Part or all of the steps in the identification method for the proportion of green assets in digital assets.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to enable the computer to execute the method described in the above method embodiments Part or all of the steps of any method for identifying the proportion of green assets in digital assets based on text recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Development Economics (AREA)
  • Human Resources & Organizations (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Operations Research (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the technical field of artificial intelligence, and in particular, to a method for recognizing a proportion of green assets in digital assets and a related product. The method comprises: performing text recognition on obtained position holding data of digital assets to be recognized to obtain a plurality of first digital assets and second digital assets; obtaining at least one first text segment according to asset information of the first digital assets; determining a similarity between the first text segment and a plurality of second text segments; determining a target first text segment according to the similarity between the first text segment and the plurality of second text segments; determining a proportion of green assets in the first digital assets according to an asset distribution described by the target first text segment; and determining, according to the proportion of the green assets in the first digital assets and a proportion of green assets in the second digital assets, a proportion of green assets in the digital assets to be recognized.

Description

数字资产中的绿色资产的占比的识别方法及相关产品Identification method of the proportion of green assets in digital assets and related products
优先权申明priority statement
本申请要求于2021年10月30日提交中国专利局、申请号为202111280770.2,发明名称为“数字资产中的绿色资产的占比的识别方法及相关产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on October 30, 2021 with the application number 202111280770.2 and the title of the invention is "Method for Identifying the Proportion of Green Assets in Digital Assets and Related Products", all of which The contents are incorporated by reference in this application.
技术领域technical field
本申请涉及人工智能技术领域,具体涉及一种数字资产中的绿色资产的占比的识别方法及相关产品。This application relates to the field of artificial intelligence technology, and specifically relates to a method for identifying the proportion of green assets in digital assets and related products.
背景技术Background technique
在全球气候变化合作的大背景下,各个管理部门需要厘清在自己管辖范围内的绿色和非绿色资产规模,以便更加科学的部署碳达峰和碳中和的实现路径。In the context of global climate change cooperation, various management departments need to clarify the scale of green and non-green assets within their jurisdiction in order to more scientifically deploy the path to achieve carbon peaking and carbon neutrality.
投资机构在实现碳达峰和碳中和的过程中扮演着非常重要的角色,其投资标的的选择实际上将引导企业向绿色产业和碳中和达标的方向发展。Investment institutions play a very important role in the process of achieving carbon peaking and carbon neutrality. The choice of investment targets will actually guide enterprises to develop in the direction of green industry and carbon neutrality.
发明人意识到投资机构在统计其绿色投资比例时,由于监管和保密的需要,不能进行跨部门共享,都是由各个部门进行人工统计,主观性较强,精度低。The inventor realized that when investment institutions counted their green investment ratios, due to the need for supervision and confidentiality, cross-departmental sharing cannot be carried out, and all statistics are performed manually by various departments, which is highly subjective and low in accuracy.
发明内容Contents of the invention
本申请实施例提供了一种数字资产中的绿色资产的占比的识别方法及相关产品,提高对数字资产中的绿色资产的占比的识别精度。The embodiments of the present application provide a method for identifying the proportion of green assets in digital assets and related products, so as to improve the identification accuracy of the proportion of green assets in digital assets.
第一方面,本申请实施例提供一种基于文本识别的数字资产中的绿色资产的占比的识别方法,包括:对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据,并将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。In the first aspect, the embodiment of the present application provides a method for identifying the proportion of green assets in digital assets based on text recognition, including: performing text recognition on the acquired position data of digital assets to be identified, and obtaining a plurality of first digital assets and second digital assets, wherein the asset information of each of the first digital assets is disclosed in the position data, and the asset information of the second digital assets is not disclosed in the position data; according to each of the Asset information of a digital asset, obtaining the disclosure data of each of the first digital assets, and inputting the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation to obtain at least one first text segment, wherein , the at least one first text segment is used to describe the asset distribution of each of the first digital assets; according to the similarity model, the similarity between each of the first text segments and a plurality of second text segments is determined, Wherein, the multiple second text segments are used to describe multiple fund distributions with green attributes; according to the similarity between each of the first text segments and the multiple second text segments, the at least A target first text segment in a first text segment; determine the green color in each of the first digital assets according to the asset distribution described in the target first text segment and the total amount of each of the first digital assets The proportion of assets; according to the portrait of the manager of the digital asset to be identified, obtain all digital assets managed by the manager, and obtain the average value of green assets in digital assets that disclose asset information among all digital assets proportion, and take the average proportion as the proportion of green assets in the second digital asset; according to the proportion of green assets in each of the first digital assets and the green The proportion of assets is to determine the proportion of green assets in the digital assets to be identified.
第二方面,本申请实施例提供一种绿色资产的占比的识别装置,包括:获取单元和处理单元;所述获取单元,用于获取待识别数字资产的持仓数据;所述处理单元,用于对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;所述获取单元,还用于根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据;所述处理单元,还用于将各所述第一数字资产的披露数据输入 到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。In the second aspect, the embodiment of the present application provides an identification device for the proportion of green assets, including: an acquisition unit and a processing unit; the acquisition unit is used to acquire the position data of the digital assets to be identified; the processing unit uses After performing text recognition on the obtained position data of digital assets to be identified, a plurality of first digital assets and second digital assets are obtained, wherein the asset information of each of the first digital assets is disclosed in the position data, so The asset information of the second digital asset is not disclosed in the position data; the acquiring unit is further configured to acquire the disclosed data of each of the first digital assets according to the asset information of each of the first digital assets; The processing unit is further configured to input the disclosure data of each of the first digital assets into the machine reading comprehension model for text segmentation to obtain at least one first text segment, wherein the at least one first text segment is used to describe each The asset distribution of the first digital asset; according to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe multiple A fund distribution with a green attribute; according to the similarity between each of the first text segments and the plurality of second text segments, determine the target first text segment in the at least one first text segment; according to According to the distribution of assets described in the target first text paragraph, and the total amount of each of the first digital assets, determine the proportion of green assets in each of the first digital assets; according to the management of the digital assets to be identified The portrait of the manager, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed in all the digital assets, and use the average proportion as the first The proportion of green assets in the second digital assets; according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets, determine the proportion of the digital assets to be identified Proportion of green assets.
第三方面,本申请实施例提供一种电子设备,其中,包括:处理器和存储器,所述处理器与所述存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述电子设备执行以下步骤的指令:In a third aspect, an embodiment of the present application provides an electronic device, which includes: a processor and a memory, the processor is connected to the memory, the memory is used to store computer programs, and the processor is used to execute the A computer program stored in the memory to cause the electronic device to perform the following steps:
对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;performing text recognition on the acquired position data of digital assets to be identified, and obtaining a plurality of first digital assets and second digital assets, wherein the asset information of each of the first digital assets is disclosed in the position data, and the The asset information of the second digital asset is not disclosed in the position data;
根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据,并将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;According to the asset information of each of the first digital assets, obtain the disclosure data of each of the first digital assets, and input the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation, to obtain at least one first digital asset a text segment, wherein the at least one first text segment is used to describe the asset distribution of each of the first digital assets;
根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;According to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;determining a target first text segment in the at least one first text segment according to the similarity between each of the first text segments and the plurality of second text segments;
根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;Determine the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets;
根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;According to the portrait of the manager of the digital asset to be identified, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。Determine the proportion of green assets in the digital assets to be identified according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets.
第四方面,本申请实施例提供一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以使得计算机执行以下步骤的指令:In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor so that the computer performs the following steps:
对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;performing text recognition on the acquired position data of digital assets to be identified, and obtaining a plurality of first digital assets and second digital assets, wherein the asset information of each of the first digital assets is disclosed in the position data, and the The asset information of the second digital asset is not disclosed in the position data;
根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据,并将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;According to the asset information of each of the first digital assets, obtain the disclosure data of each of the first digital assets, and input the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation, to obtain at least one first digital asset a text segment, wherein the at least one first text segment is used to describe the asset distribution of each of the first digital assets;
根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;According to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;determining a target first text segment in the at least one first text segment according to the similarity between each of the first text segments and the plurality of second text segments;
根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;Determine the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets;
根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均 占比作为所述第二数字资产中的绿色资产的占比;According to the portrait of the manager of the digital asset to be identified, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。Determine the proportion of green assets in the digital assets to be identified according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets.
第五方面,本申请实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机可操作来使计算机执行如第一方面所述的方法。In a fifth aspect, an embodiment of the present application provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer is operable to enable the computer to execute the computer program described in the first aspect. Methods.
实施本申请实施例,具有如下有益效果:Implementing the embodiment of the present application has the following beneficial effects:
可以看出,在本申请实施方式中,通过获取待识别数字资产的持仓数据,并基于持仓数据拆分出第一数字资产和第二数字资产,然后基于文本识别技术以及机器模型,可以自动识别出第一数字资产和第二数字资产中的绿色资产的占比,最后基于第一数字资产和第二数字资产中的绿色资产的占比可自动识别出待识别数字资产中的绿色资产的占比,无需人工去待识别数字资产(基金)中的绿色资产的占比,从而节约了人工成本,并且避免了人工统计过程所带来的主观性,提高了对基金中的绿色资产的占比的识别精度。It can be seen that in the implementation of this application, by obtaining the position data of the digital asset to be identified, and splitting the first digital asset and the second digital asset based on the position data, and then based on the text recognition technology and the machine model, it can be automatically identified The proportion of green assets in the first digital asset and the second digital asset can be obtained, and finally based on the proportion of green assets in the first digital asset and the second digital asset, the proportion of green assets in the digital assets to be identified can be automatically identified. There is no need to manually check the proportion of green assets in digital assets (funds) to be identified, thereby saving labor costs, avoiding the subjectivity caused by manual statistical processes, and increasing the proportion of green assets in funds recognition accuracy.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without making creative efforts.
图1为本申请实施例提供的一种基于文本识别的数字资产中绿色资产的占比的识别方法的流程示意图;FIG. 1 is a schematic flowchart of a method for identifying the proportion of green assets in digital assets based on text recognition provided by an embodiment of the present application;
图2为本申请实施例提供的一种基金的持仓数据的示意图;Fig. 2 is a schematic diagram of the position data of a fund provided by the embodiment of the present application;
图3为本申请实施例提供的一种股票中绿色资产的占比的识别方法流程示意图;FIG. 3 is a schematic flowchart of a method for identifying the proportion of green assets in stocks provided by an embodiment of the present application;
图4为本申请实施例提供的一种相似度模型训练方法的流程示意图;Fig. 4 is a schematic flow chart of a similarity model training method provided by the embodiment of the present application;
图5为本申请实施例提供的一种债券中绿色资产的占比的识别方法流程示意图;FIG. 5 is a schematic flowchart of a method for identifying the proportion of green assets in a bond provided by an embodiment of the present application;
图6为本申请实施例提供的一种绿色资产的占比的识别装置的功能单元组成框图;FIG. 6 is a block diagram of functional units of an identification device for the proportion of green assets provided by the embodiment of the present application;
图7为本申请实施例提供的一种电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of this application.
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third" and "fourth" in the specification and claims of the present application and the drawings are used to distinguish different objects, rather than to describe a specific order . Furthermore, the terms "include" and "have", as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally further includes For other steps or units inherent in these processes, methods, products or apparatuses.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结果或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习 等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
首先说明,本申请的应用场景为识别出基金中的绿色资产。因此,本申请所涉及的待识别数字资产为待识别股票。基金一般由多只股票、多只债券以及其他固收组成,而且,在进行持仓披露时,每只股票的资产信息都是完全披露的,比如,股票的名称,股票的占比,股票的净值都是完全披露的,等等;然而,对于债券来说,并不是所有的资产信息都公开,比如,有些债券公开了债券的名称、占比、净值,等等,本申请中将这样的债券称为已披露债券;有些债券没有公开任何信息,比如,未公开占比、净值,等等,将这些未公开的债券统一称为其他债券,等等,本申请中将这样债券统称为未披露债券,并且将这样的债券作为一个整体考虑,不再进行细分。对于其他固收通常是由银行存款等不具备或者无法判断绿色成分的固收资产组成,因为该部分不做统计。所以,本申请主要统计从基金中的股票和债券出发,去识别股票中的绿色比例。Firstly, the application scenario of this application is to identify the green assets in the fund. Therefore, the unidentified digital assets involved in this application are unidentified stocks. Funds are generally composed of multiple stocks, multiple bonds, and other fixed income. Moreover, when disclosing positions, the asset information of each stock is fully disclosed, such as the name of the stock, the proportion of the stock, and the net value of the stock. are fully disclosed, etc.; however, for bonds, not all asset information is disclosed. For example, some bonds disclose the name, proportion, net value, etc. of the bond. This application will refer to such bonds It is called disclosed bonds; some bonds do not disclose any information, such as undisclosed proportion, net value, etc., and these undisclosed bonds are collectively referred to as other bonds, etc., and such bonds are collectively referred to as undisclosed in this application bonds, and consider such bonds as a whole without subdividing them. For other fixed income, it is usually composed of bank deposits and other fixed income assets that do not have or cannot be judged as green components, because this part is not counted. Therefore, this application mainly starts from the stocks and bonds in the fund to identify the green ratio in the stocks.
为了便于描述,本申请中将待识别数字资产中披露了资产信息的多个数字资产(包括多只股票和多只已披露债券)称为多个第一数字资产,未披露资产信息的数字资产,即未披露债券称为第二数字资产。For ease of description, in this application, among the digital assets to be identified, multiple digital assets (including multiple stocks and multiple disclosed bonds) that disclose asset information are referred to as multiple first digital assets, and digital assets that do not disclose asset information , that is, the undisclosed bond is called the second digital asset.
下面将结合附图分别介绍如何获取每个第一数字资产中的绿色资产的占比以及第二数字资产中的绿色资产的占比。The following will introduce how to obtain the proportion of green assets in each first digital asset and the proportion of green assets in the second digital asset respectively with reference to the accompanying drawings.
参阅图1,图1为本申请实施例提供的一种基于文本识别的数字资产中的绿色资产的占比的识别方法。该方法应用于绿色资产的占比的识别装置。该方法包括以下步骤内容:Referring to FIG. 1 , FIG. 1 is a method for identifying the proportion of green assets in digital assets based on text recognition provided by an embodiment of the present application. The method is applied to the identification device of the proportion of green assets. The method includes the following steps:
101:获取待识别数字资产的持仓数据。101: Obtain the position data of the digital asset to be identified.
示例性的,通过爬虫技术从该待识别数字资产的发行公司的平台或者从该待识别数字资产的第三方管理平台中获取该待识别数字资产的持仓数据。Exemplarily, the position data of the digital asset to be identified is acquired from the platform of the issuing company of the digital asset to be identified or from the third-party management platform of the digital asset to be identified by crawler technology.
102:对持仓数据进行文本识别,得到多个第一数字资产和第二数字资产。102: Perform text recognition on the position data to obtain a plurality of first digital assets and second digital assets.
示例性的,如图2所示,对持仓数据进行文本识别,得到关键词“股票名称”,然后对股票名称下的各个元素进行文本识别,得到多个第一数字资产中的部分第一数字资产,如图2示出的股票“中国中免”、“五粮液”,等等;同样,对持仓数据进行识别,得到关键词“债券名称”,然后对债券名称下的各个元素进行文本识别,得到多个第一数字资产中的另外一部分第一数字资产,如图2示出的债券“20农发09”、“21国开01”,等等。额外说明,对于第二数字资产来说,由于持仓数据中未披露该数字资产的资产信息,所以,从持仓数据中无法知道这些数字资产具体是什么,本申请中将这些未披露债券统一称为第二数字资产,也就是默认待识别数字资产中包含有第二数字资产,因此不用对持仓数据进行文本识别,默认待识别数字资产中包含有第二数字资产。Exemplarily, as shown in Figure 2, text recognition is performed on the position data to obtain the keyword "stock name", and then text recognition is performed on each element under the stock name to obtain part of the first numbers in multiple first digital assets Assets, such as the stocks "China CDFG", "Wuliangye", etc. as shown in Figure 2; similarly, identify the position data, get the keyword "bond name", and then perform text recognition on each element under the bond name, Obtain another part of the first digital assets among the multiple first digital assets, such as the bonds "20 Agricultural Development 09", "21 National Development 01", etc. as shown in Figure 2 . In addition, for the second digital asset, since the asset information of the digital asset is not disclosed in the position data, it is impossible to know what these digital assets are from the position data. In this application, these undisclosed bonds are collectively referred to as The second digital asset, that is, the second digital asset is included in the digital asset to be identified by default, so there is no need to perform text recognition on the position data, and the second digital asset is included in the digital asset to be identified by default.
103:根据各第一数字资产的资产信息,获取各第一数字资产的披露数据,并将各第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,至少一个第一文本段用于描述各第一数字资产的资产分布。103: Obtain the disclosure data of each first digital asset according to the asset information of each first digital asset, and input the disclosure data of each first digital asset into the machine reading comprehension model for text segmentation to obtain at least one first text segment, Wherein, at least one first text segment is used to describe the asset distribution of each first digital asset.
示例性的,根据该持仓数据得到每个第一数字资产的资产名称,然后基于该资产名称通过爬虫技术获取每个第一数字资产的披露数据。Exemplarily, the asset name of each first digital asset is obtained according to the position data, and then the disclosure data of each first digital asset is obtained through crawler technology based on the asset name.
可选的,当第一数字资产为股票时,第一数字资产的披露数据为该股票的披露文档,即该股票所属企业发布的年报,则第一文本段所描述的资金分布是该股票所属企业的子产品的占比;基于机器阅读理解模型对该所属企业的年报进行文本分割,得到该至少一个第一文本段。后面详细介绍如何对年报进行文本分割,以及如何得到股票中的绿色资产的占比,在此不做过多描述。Optionally, when the first digital asset is a stock, the disclosure data of the first digital asset is the disclosure document of the stock, that is, the annual report issued by the company to which the stock belongs, and the fund distribution described in the first text paragraph is the The proportion of the sub-products of the enterprise; performing text segmentation on the annual report of the enterprise based on the machine reading comprehension model to obtain the at least one first text segment. I will describe in detail later how to segment the text of the annual report and how to obtain the proportion of green assets in stocks, so I won’t describe too much here.
可选的,当第一数字资产为债券(已披露债券)时,则第一数字资产的披露数据就是该债券的披露数据,也就是该债券的发行公司募集债券时的披露数据,则第一文本段所描述的资金分布就是该债券的资金用途。因此,通过机器阅读理解模型对该债券的披露数据进行文本分割,得到该至少一个第一文本段。后面详细介绍如何对该披露数据进行文本分割,以及 如何得到已披露债券中的绿色资产的占比,在此不做过多描述。Optionally, when the first digital asset is a bond (disclosed bond), the disclosure data of the first digital asset is the disclosure data of the bond, that is, the disclosure data of the bond issuer when the bond is raised, and the first The distribution of funds described in the text paragraph is the use of funds for this bond. Therefore, the at least one first text segment is obtained by performing text segmentation on the bond disclosure data through a machine reading comprehension model. I will describe in detail later how to segment the disclosed data and how to obtain the proportion of green assets in the disclosed bonds, so I won’t go into too much detail here.
104:根据相似度模型,确定各第一文本段分别与多个第二文本段之间的相似度,其中,多个第二文本段用于描述多个具有绿色属性的资金分布。104: According to the similarity model, determine the similarity between each first text segment and multiple second text segments, where the multiple second text segments are used to describe multiple fund distributions with green attributes.
示例性的,当计算股票中的绿色资产的占比时,则该多个第二文本段所描述的多个资金分布为多个具有绿色属性的产业,简称多个第一产业。Exemplarily, when calculating the proportion of green assets in stocks, the multiple funds described in the multiple second text paragraphs are distributed into multiple industries with green attributes, referred to as multiple first industries.
示例性的,当计算已披露债券的绿色资产的占比时,则该多个文本段所描述的多个资金分布为多个具有绿色属性的资金用途。Exemplarily, when calculating the proportion of green assets of disclosed bonds, the multiple funds described in the multiple text paragraphs are distributed as multiple fund uses with green attributes.
105:根据各第一文本段分别与多个第二文本段之间的相似度,确定至少一个第一文本段中的目标第一文本段。105: Determine a target first text segment in at least one first text segment according to similarities between each first text segment and multiple second text segments.
示例性的,确定各第一文本段与多个第二文本段之间的相似度中的最大相似度,若该最大相似度大于预设阈值,则将第一文本段作为目标第一文本段。Exemplarily, determine the maximum similarity among the similarities between each first text segment and multiple second text segments, and if the maximum similarity is greater than a preset threshold, use the first text segment as the target first text segment .
106:根据目标第一文本段所描述的资产分布,以及各第一数字资产的总金额,确定各第一数字资产中的绿色资产的占比。106: Determine the proportion of green assets in each first digital asset according to the asset distribution described in the target first text paragraph and the total amount of each first digital asset.
示例性的,当第一数字资产为股票时,则将该目标第一文本段所描述的子产品的占比作为各第一数字资产中的绿色资产的占比。当第一数字资产为债券时,则将目标第一文本段所描述的资金用途中所规划的资金占该第一数字资产的总金额的比例,作为该第一数字资产中的绿色资产的占比。Exemplarily, when the first digital asset is a stock, the proportion of the sub-product described in the target first text paragraph is used as the proportion of green assets in each first digital asset. When the first digital asset is a bond, the proportion of the funds planned in the fund use described in the first text paragraph of the target to the total amount of the first digital asset is taken as the proportion of green assets in the first digital asset Compare.
107:根据待识别数字资产的管理者的画像,获取管理者管理的所有数字资产,并获取所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将平均占比作为第二数字资产中的绿色资产的占比。107: According to the portrait of the manager of the digital asset to be identified, obtain all digital assets managed by the manager, and obtain the average proportion of green assets in digital assets that disclose asset information among all digital assets, and use the average proportion as The proportion of green assets in the second digital asset.
示例性的,根据该待识别数字资产的管理者的画像(本申请中可以理解为基金经理),获取该管理者所管理的所有数字资产;获取所有数字资产存在披露信息的数字资产中的绿色资产的平均占比,并将平均占比作为第二数字资产中的绿色资产的占比。Exemplarily, according to the portrait of the manager of the digital asset to be identified (which can be understood as a fund manager in this application), all digital assets managed by the manager are obtained; The average proportion of assets, and take the average proportion as the proportion of green assets in the second digital asset.
具体的,获取该基金经理管理的所有基金;然后,对该基金经理管理的任意一个基金中的任意一个已披露债券,按照上述获取绿色资产的占比的方式,获取该任意一个基金中的任意一个已披露债券中绿色资产的占比,然后将该任意一个基金中的所有已披露债券中绿色资产的占比求和,得到该任意一个基金中与债券相关的占比;最后,将所有管理的基金中与债券相关的占比求平均值,得到该平均占比。Specifically, obtain all the funds managed by the fund manager; then, obtain any disclosed bond in any fund managed by the fund manager, according to the above-mentioned method of obtaining the proportion of green assets, obtain any The proportion of green assets in a disclosed bond, and then sum the proportions of green assets in all disclosed bonds in this arbitrary fund to obtain the proportion related to bonds in this arbitrary fund; finally, all managed Average the bond-related proportions in the fund to obtain the average proportion.
108:根据各第一数字资产中的绿色资产的占比以及第二数字资产中的绿色资产的占比,确定待识别数字资产中的绿色资产的占比。108: Determine the proportion of green assets in the digital assets to be identified according to the proportions of green assets in the first digital assets and the proportions of green assets in the second digital assets.
示例性的,获取每个第一数字资产的净值相对于待识别数字资产的净值的第一比例,如图2所示,可以对持仓数据进行文本识别,得到该第一比例;然后,根据每个第一数字资产的第一比例和绿色资产的占比,确定每个第一数字资产中的绿色资产相对于待识别数字资产的净值的第一占比。Exemplarily, to obtain the first ratio of the net value of each first digital asset relative to the net value of the digital asset to be identified, as shown in Figure 2, text recognition can be performed on the position data to obtain the first ratio; then, according to each The first proportion of the first digital asset and the proportion of the green asset determine the first proportion of the green asset in each first digital asset relative to the net value of the digital asset to be identified.
示例性的,每个第一数字资产的第一占比可以通过公式(1)表示:Exemplarily, the first proportion of each first digital asset can be expressed by formula (1):
Figure PCTCN2022090224-appb-000001
Figure PCTCN2022090224-appb-000001
Figure PCTCN2022090224-appb-000002
为多个第一数字资产中的第i个第一数字资产的第一占比,
Figure PCTCN2022090224-appb-000003
为第i个第一数字资产的第一比例,
Figure PCTCN2022090224-appb-000004
为第i个第一数字资产中绿色资产的占比。
Figure PCTCN2022090224-appb-000002
is the first proportion of the i-th first digital asset among multiple first digital assets,
Figure PCTCN2022090224-appb-000003
is the first proportion of the i-th first digital asset,
Figure PCTCN2022090224-appb-000004
is the proportion of green assets in the i-th first digital asset.
示例性的,由于第二数字资产的资产信息未披露,所以无法从持仓数据中直接获取第二数字资产的净值相对于待识别数字资产的净值的第二比例。但是,持仓数据会披露每个第一数字资产相对于待识别数字资产的净值的总比例。因此可根据持仓数据以及每个第一数字资产的第一比例,确定第二数字资产的净值相对于待识别数字资产的净值的第二比例。Exemplarily, since the asset information of the second digital asset is not disclosed, it is impossible to directly obtain the second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified from the position data. However, the position data will disclose the total ratio of each first digital asset relative to the net value of the digital asset to be identified. Therefore, the second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified can be determined according to the position data and the first ratio of each first digital asset.
示例性的,第二数字资产的第二比例可以通过公式(2)表示:Exemplarily, the second ratio of the second digital asset can be expressed by formula (2):
Figure PCTCN2022090224-appb-000005
Figure PCTCN2022090224-appb-000005
其中,HP b2为第二数字资产的净值相对于待识别数字资产的净值的第二比例,
Figure PCTCN2022090224-appb-000006
为第i个第一数字资产相对于待识别数字资产的净值的比例,m为多个第一数字资产的数量。
Among them, HP b2 is the second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified,
Figure PCTCN2022090224-appb-000006
is the ratio of the i-th first digital asset to the net value of the digital asset to be identified, and m is the number of multiple first digital assets.
进一步地,根据第二数字资产的第二比例以及绿色资产的占比,确定第二数字资产中的绿色资产相对于待识别数字资产的净值的第二占比。Further, according to the second ratio of the second digital asset and the ratio of the green asset, determine the second ratio of the green asset in the second digital asset relative to the net value of the digital asset to be identified.
示例性的,第二数字资产的第二占比可以通过公式(3)表示:Exemplarily, the second proportion of the second digital asset can be expressed by formula (3):
Figure PCTCN2022090224-appb-000007
Figure PCTCN2022090224-appb-000007
其中,FG b2为第二数字资产的第二占比,
Figure PCTCN2022090224-appb-000008
为第二数字资产中绿色资产的占比。
Among them, FG b2 is the second proportion of the second digital asset,
Figure PCTCN2022090224-appb-000008
is the proportion of green assets in the second digital asset.
示例性的,对每个第一数字资产的第一占比以及第二数字资产的第二占比进行求和,得到待识别数字资产中的绿色资产的占比。Exemplarily, the first proportion of each first digital asset and the second proportion of the second digital asset are summed to obtain the proportion of green assets among the digital assets to be identified.
示例性的,待识别数字资产中的绿色资产的占比可以通过公式(4)表示:Exemplarily, the proportion of green assets in digital assets to be identified can be expressed by formula (4):
Figure PCTCN2022090224-appb-000009
Figure PCTCN2022090224-appb-000009
其中,FG为待识别数字资产中的绿色资产的占比。Among them, FG is the proportion of green assets among the digital assets to be identified.
在本申请的一个实施方式中,在确定待识别数字资产中的绿色资产的占比之前,还可以对持仓数据进行文本识别,得到多个第一数字资产中的部分第一数字资产的总金额、第二数字资产的总金额,以及待识别数字资产的总金额,其中,该部分第一数字资产即为该多个第一数字资产中的已披露债券;对持仓数据进行文本识别,得到部分第一数字资产的总净值、第二数字资产的总净值,以及待识别数字资产的总净值;确定部分第一数字资产的总金额和第二数字资产的总金额之和,相对于待识别数字资产的总金额的第三比例,即Pb v;确定部分第一数字资产的总净值和第二数字资产的总净值之和,相对于待识别数字资产的总净值的第四比例,即Pb npv;根据第三比例和第四比例,确定杠杆比例,即确定出债券(包括已披露债券和未披露债券)的杠杆比例;示例性的,杠杆比例为Pb v/Pb npvIn one embodiment of the present application, before determining the proportion of green assets in the digital assets to be identified, text recognition can also be performed on the position data to obtain the total amount of some of the first digital assets among the multiple first digital assets , the total amount of the second digital asset, and the total amount of the digital asset to be identified, wherein, this part of the first digital asset is the disclosed bond among the multiple first digital assets; perform text recognition on the position data, and obtain a part The total net value of the first digital asset, the total net value of the second digital asset, and the total net value of the digital asset to be identified; determine the sum of the total amount of the first digital asset and the total amount of the second digital asset, relative to the number to be identified The third ratio of the total amount of assets, that is, Pb v ; determine the sum of the total net value of the first digital asset and the total net value of the second digital asset, relative to the fourth ratio of the total net value of the digital asset to be identified, that is, Pb npv ; Determine the leverage ratio according to the third ratio and the fourth ratio, that is, determine the leverage ratio of bonds (including disclosed bonds and undisclosed bonds); for example, the leverage ratio is Pb v /Pb npv .
之所以要计算杠杆比例,是因为在统计债券中的绿色资产的占比时,所利用的债券资产是加了杠杆之后的债券资产,导致统计出的占比例偏高,因此,需要去除加杠杆的影响。因此,根据杠杆比例,分别对部分第一数字资产的第一占比和第二数字资产的第二占比进行去杠杆,得到部分第一数字资产的第一目标占比和第二数字资产的第二目标占比;最后,对多个第一数字资产中的另外一部分第一数字资产(也就是多个第一数字资产中的股票)的第一占比、部分第一数字资产的第一目标占比以及第二数字资产的第二目标占比进行求和,得到待识别数字资产中的绿色资产的占比。The reason for calculating the leverage ratio is because when calculating the proportion of green assets in bonds, the bond assets used are bond assets after leverage has been added, resulting in a relatively high proportion in the statistics. Therefore, it is necessary to remove the leverage Impact. Therefore, according to the leverage ratio, the first proportion of some of the first digital assets and the second proportion of the second digital asset are respectively deleveraged to obtain the first target proportion of some of the first digital assets and the second proportion of the second digital asset. The second target proportion; finally, the first proportion of another part of the first digital assets in the multiple first digital assets (that is, the stocks in the multiple first digital assets), the first proportion of some of the first digital assets The target proportion and the second target proportion of the second digital asset are summed to obtain the proportion of green assets in the digital assets to be identified.
示例性的,待识别数字资产的绿色比例可以通过公式(5)表示:Exemplarily, the green ratio of digital assets to be identified can be expressed by formula (5):
Figure PCTCN2022090224-appb-000010
Figure PCTCN2022090224-appb-000010
其中,m 1为另外一部分第一数字资产的数量,m 2为一部分第一数字资产的数量,m 1+m 2=m。 Wherein, m 1 is the quantity of another part of the first digital asset, m 2 is the quantity of a part of the first digital asset, m 1 +m 2 =m.
可以看出,在本申请实施方式中,通过获取待识别数字资产的持仓数据,并基于持仓数据拆分出第一数字资产和第二数字资产,然后基于文本识别技术以及机器模型,可以自动识别出第一数字资产和第二数字资产中的绿色资产的占比,最后基于第一数字资产和第二数字资产中的绿色资产的占比可自动识别出待识别数字资产中的绿色资产的占比,无需人工去待识别数字资产(基金)中的绿色资产的占比,从而节约了人工成本,并且避免了人工统计过程所带来的主观性,提高了对基金中的绿色资产的占比的识别精度。It can be seen that in the implementation of this application, by obtaining the position data of the digital asset to be identified, and splitting the first digital asset and the second digital asset based on the position data, and then based on the text recognition technology and the machine model, it can be automatically identified The proportion of green assets in the first digital asset and the second digital asset can be obtained, and finally based on the proportion of green assets in the first digital asset and the second digital asset, the proportion of green assets in the digital assets to be identified can be automatically identified. There is no need to manually check the proportion of green assets in digital assets (funds) to be identified, thereby saving labor costs, avoiding the subjectivity caused by manual statistical processes, and increasing the proportion of green assets in funds recognition accuracy.
在本申请的一个实施方式中,该待识别数字资产为投资机构在t时刻持有的多个待识别数字资产中的任意一个,也就是该投资机构持有的多个基金中的任意一个基金。可选的,可基于图1示出的方法,确定该多个待识别数字资产中的每个待识别数字资产中的绿色资产的占比。In one embodiment of the present application, the digital asset to be identified is any one of the multiple digital assets to be identified held by the investment institution at time t, that is, any one of the multiple funds held by the investment institution . Optionally, based on the method shown in FIG. 1 , the proportion of green assets in each digital asset to be identified among the plurality of digital assets to be identified may be determined.
示例性的,获取t时刻下每个待识别数字资产的净值,以及投资机构持有每个待识别数字资产的份额;根据t时刻下每个待识别数字资产的净值以及投资机构持有每个待识别数字资产的份额,以及每个待识别数字资产中的绿色资产的占比,确定投资机构持有每个待识别数字资产的绿色规模。Exemplarily, obtain the net value of each digital asset to be identified at time t, and the share of each digital asset to be identified held by an investment institution; according to the net value of each digital asset to be identified at time t and each The share of digital assets to be identified, and the proportion of green assets in each digital asset to be identified, determine the green scale of each digital asset to be identified held by investment institutions.
示例性的,投资机构持有每个待识别数字资产的绿色规模可以通过公式(6)表示:Exemplarily, the green scale of each digital asset to be identified held by an investment institution can be expressed by formula (6):
S i=FG i*V i*R i   公式(6); S i =FG i *V i *R i formula (6);
其中,S i为投资机构持有该多个待识别数字资产中的第i个待识别数字资产的绿色规模,FG i为第i个待识别数字资产中的绿色资产的占比,V i述t时刻下第i个待识别数字资产的净值,R i为t时刻下投资机构持有第i个待识别数字资产的份额。 Among them, S i is the green scale of the i-th unidentified digital asset held by the investment institution, FG i is the proportion of green assets in the i-th unidentified digital asset, and V i is described The net value of the i-th digital asset to be identified at time t, R i is the share of the i-th digital asset to be identified held by the investment institution at time t.
参阅图3,图3为本申请实施例提供的一种股票中的绿色资产的占比的识别方法流程示意图。该实施例中与图1所示的实施例相同的内容,此处不再重复描述。本实施例的方法包括以下步骤:Referring to FIG. 3 , FIG. 3 is a schematic flowchart of a method for identifying the proportion of green assets in stocks provided by an embodiment of the present application. The content in this embodiment is the same as that in the embodiment shown in FIG. 1 , and will not be described again here. The method of the present embodiment comprises the following steps:
301:对各第一数字资产的披露文档进行文本识别,得到披露文档中的目标章节,其中,目标章节用于描述每个第一数字资产的所属企业的主营产品,且目标章节包括目标表格和目标文本段。301: Perform text recognition on the disclosure documents of each first digital asset to obtain the target chapters in the disclosure documents, wherein the target chapters are used to describe the main products of the company to which each first digital asset belongs, and the target chapters include target tables and the target text segment.
其中,该披露文档为该第一数字资产的发行公司针对该第一数字资产的年报。一般来说,公司年报中的“第四节经营情况讨论与分析”章节中的“一、概述”章节用来描述公司的主营产品。因此,对披露文档进行文本识别,定位出“第四节经营情况讨论与分析”章节;然后,再对该章节进行文本识别,得到该章节下的细分章节,即“一、概述”章节,并将该细分章节作为目标章节。Wherein, the disclosure document is the annual report of the issuing company of the first digital asset for the first digital asset. Generally speaking, the chapter "I. Overview" in the chapter "Section Four Discussion and Analysis of Business Situation" in the company's annual report is used to describe the company's main products. Therefore, text recognition is performed on the disclosure document, and the chapter "Section 4 Discussion and Analysis of Business Situation" is located; then, text recognition is performed on this chapter to obtain subdivided chapters under this chapter, that is, the chapter "I. Overview", And use this subdivision chapter as the target chapter.
示例性的,目标章节包含第一目标表格和目标文本段,其中,该目标文本段用于描述该所属企业的主营产品;目标表格用于描述主营产品以及主营产品的营业额相对于所属企业的总营业额的占比,即主营产品的占比。Exemplarily, the target section includes a first target table and a target text segment, wherein the target text segment is used to describe the main product of the enterprise to which it belongs; the target table is used to describe the main product and the turnover of the main product relative to The proportion of the total turnover of the affiliated enterprise, that is, the proportion of the main product.
应说明,对于一个企业来说,主营产品可以有一个或多个,本申请中以一个主营产品为例进行说明,针对多个主营产品的情况与此类似,不再叙述。It should be noted that for an enterprise, there may be one or more main products. In this application, one main product is used as an example for illustration. The situation for multiple main products is similar and will not be described again.
302:对目标文本段和目标表格均进行实体识别,得到主营产品以及主营产品的占比,其中,主营产品的占比为主营产品的营业额与所属企业的总营业额的比值。302: Perform entity recognition on both the target text segment and the target table to obtain the main product and the proportion of the main product, where the proportion of the main product is the ratio of the turnover of the main product to the total turnover of the affiliated enterprise .
示例性的,对目标文本段进行实体识别,获取与产品相关的实体,并将该实体对应的产品作为所属企业的主营产品。Exemplarily, the entity recognition is performed on the target text segment, the entity related to the product is obtained, and the product corresponding to the entity is used as the main product of the enterprise to which it belongs.
举例来说,目标文本段描述了所属企业的主营产品为“新能源电池”,则通过实体识别,可得到该所属企业的主营产品为“新能源电池”。For example, if the target text segment describes that the main product of the affiliated enterprise is "new energy battery", then through entity recognition, it can be obtained that the main product of the affiliated enterprise is "new energy battery".
进一步地,对目标表格进行实体识别,确定出该目标表格中“新能源电池”所在的位置,并基于该新能源电池”在该目标表格中位置,从该表格中读取出该“新能源电池”营业额相对于该所属企业的总营业额的占比。Further, carry out entity recognition on the target table, determine the location of the "new energy battery" in the target table, and read the "new energy battery" from the table based on the location of the "new energy battery" in the target table. "Battery" turnover relative to the total turnover of the affiliated enterprise.
303:将目标文本段输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,至少一个第一文本段用于描述主营产品下的至少一子产品。303: Input the target text segment into the machine reading comprehension model for text segmentation to obtain at least one first text segment, and the at least one first text segment is used to describe at least one sub-product under the main product.
示例性的,机器阅读理解(Machine Reading Comprehension,MRC)模型是预先训练好的,本申请不再叙述对该MRC模型进行训练的过程。针对本申请的文本分割过程,首先设置该MRC模型的问题为:“哪些产品是主营产品的子产品(即细分产品)”,该主营产品即为上述对目标文本段进行实体识别出的主营产品,并设置该MRC模型输入的文章为该目标文本段;然后,通过MRC模型的编码层对问题进行编码,得到第一向量;对目标文本段中的各个子文本段进行编码,得到与各个子文本段对应的第二向量;然后,将第一向量和各个子文本段的第二向量输入到MRC模型的交互层进行交互,得到问题和各个子文本段之间的相似度,将相似度大于预设阈值的子文本段作为该至少一个第一文本段。Exemplarily, the machine reading comprehension (Machine Reading Comprehension, MRC) model is pre-trained, and this application does not describe the process of training the MRC model. For the text segmentation process of the present application, the problem of setting the MRC model at first is: "Which products are the sub-products (i.e. subdivided products) of the main product", and the main product is the above-mentioned entity recognition of the target text segment The main product of the MRC model, and set the article input by the MRC model as the target text segment; then, encode the question through the encoding layer of the MRC model to obtain the first vector; encode each sub-text segment in the target text segment, Obtain the second vector corresponding to each subtext segment; then, input the first vector and the second vector of each subtext segment to the interactive layer of the MRC model for interaction, and obtain the similarity between the question and each subtext segment, A subtext segment whose similarity is greater than a preset threshold is used as the at least one first text segment.
进一步的,分别对至少一个第一文本段进行实体识别,可得到该主营产品下的至少一个子产品。Further, by performing entity recognition on at least one first text segment, at least one sub-product under the main product can be obtained.
举例来说,目标文本段可以描述了多个主营产品,以及每个主营产品下的子产品。比如,描述的主营产品包括“新能源电池”、“风力发电”,则针对主营产品“新能源电池”来说,则将目标文本段输入到MRC模型之后,输出的第一文本段是用来描述电池的文本段,比如,识别出的至少一个第一文本段分别用来描述“锂电池”、“核电池”,等其他新能源电池。For example, the target text segment may describe multiple main products and sub-products under each main product. For example, if the main product described includes "new energy battery" and "wind power generation", then for the main product "new energy battery", after inputting the target text segment into the MRC model, the first output text segment is A text segment used to describe the battery, for example, at least one identified first text segment is respectively used to describe "lithium battery", "nuclear battery", and other new energy batteries.
304:根据主营产品的占比,确定主营产品中的各子产品的占比。304: According to the proportion of the main product, determine the proportion of each sub-product in the main product.
示例性的,可以根据至少一个子产品的数量,将该主营产品的占比平均拆分给该至少一个子产品,得到该至少一个子产品中的每个子产品的占比。Exemplarily, according to the quantity of at least one sub-product, the proportion of the main product can be evenly split to the at least one sub-product, to obtain the proportion of each sub-product in the at least one sub-product.
应说明,若某个子产品还可以继续进行拆分,则可以将该子产品继续进行拆分,以及将该子产品的占比拆分给更细粒度的产品。本申请中主要以对主营产品进行一次拆分为例进行说明,不进行多次拆分。It should be noted that if a certain sub-product can still be split, the sub-product can be further split, and the proportion of the sub-product can be split to finer-grained products. In this application, the main product is split once as an example, and multiple splits are not performed.
举例来说,主营产品A的占比为50%,该主营产品A包括子产品b和子产品c,那么子产品b和子产品c的占比均为25%。进一步的,若子产品b包括子产品d和子产品e,则可以将子产品b的比例等分拆分,则子产品d和子产品e的占比分别12.5%和12.5%。For example, if the proportion of main product A is 50%, and the main product A includes sub-product b and sub-product c, then the proportion of sub-product b and sub-product c are both 25%. Further, if sub-product b includes sub-product d and sub-product e, the proportion of sub-product b can be divided equally, and the proportions of sub-product d and sub-product e are 12.5% and 12.5% respectively.
305:根据相似度模型,确定各第一文本段分别与多个第二文本段之间的相似度,其中,多个第二文本段描述多个的产品为具有绿色属性的产品。305: According to the similarity model, determine the similarity between each first text segment and multiple second text segments, wherein the multiple second text segments describe a plurality of products as products with a green attribute.
示例性的,获取第一预设文档,比如,该第一预设文档可以为《绿色产业指导目录的解释说明》,该第一预设文档中记载的产品均具有绿色属性;对该第一预设文档进行实体识别,得到该预设文档中记载的产业(即产品);将读取到的产品作为具有绿色属性的产品。Exemplarily, the first preset document is obtained, for example, the first preset document may be "Explanation of the Green Industry Guidance Catalog", and the products recorded in the first preset document all have green attributes; The default document performs entity recognition to obtain the industry (that is, the product) recorded in the preset document; the read product is regarded as a product with green attributes.
在本申请的一个实施方式中,第一预设文档在记载产品时,可能不会直接记载具有绿色属性的产品,而是通过文档引用的方式,通过其他文档来记载具有绿色属性的产品。因此,首先对第一预设文档进行文本识别,得到多个第三文本段,其中,多个第三文本段用于描述第一预设文档中记载的产品,但是,某个第三文本段在描述产品时,并不会直接描述该产品,而是引用其他文档来描述产品。因此若多个第三文本段中的任意一个第三文本段引用其他文档,则对其他文档进行文本识别,得到与第三文本段对应的第四文本段,其中,第四文本段是其他文档中用于描述具有绿色属性的产品的文本,并对第四文本段进行实体识别,得到第四文本段描述的产品;因此,可以将多个第三文本段和引用的第四文本段作为该多个第二文本段,并将多个第三文本段描述的产品,以及第四文本段描述的产品均作为该具有绿色属性的产品。In an embodiment of the present application, when the first preset document records products, it may not directly record products with green attributes, but record products with green attributes through document references through other documents. Therefore, firstly, text recognition is performed on the first preset document to obtain a plurality of third text segments, wherein the plurality of third text segments are used to describe the products described in the first preset document, but a certain third text segment When describing a product, it does not directly describe the product, but refers to other documents describing the product. Therefore, if any third text segment in multiple third text segments refers to other documents, text recognition is performed on other documents to obtain a fourth text segment corresponding to the third text segment, wherein the fourth text segment is other documents The text used to describe the product with the green attribute in , and perform entity recognition on the fourth text segment to obtain the product described by the fourth text segment; therefore, multiple third text segments and the referenced fourth text segment can be used as the multiple second text paragraphs, and the products described by multiple third text paragraphs and the product described by the fourth text paragraph are all the products with green attributes.
示例性的,该相似度模型为通过预先构造的多对目标训练样本进行训练得到的,后面详细描述构造多对目标训练样本的过程以及模型训练过程,在此不做过多描述。在本申请的一个实施方式中,该相似度模型可以为RoFormer模型。Exemplarily, the similarity model is obtained by training multiple pairs of target training samples constructed in advance. The process of constructing multiple pairs of target training samples and the model training process will be described in detail later, and no further description will be given here. In one embodiment of the present application, the similarity model may be a RoFormer model.
因此,将每个第一文本段以及每个第二文本段输入到该RoFormer模型中,得到每个子第一本段和每个第二文本段之间的相似度。Therefore, input each first text segment and each second text segment into the RoFormer model to obtain the similarity between each sub-first text segment and each second text segment.
306:根据各第一文本段分别与多个第二文本段之间的相似度,确定至少一个第一文本段中的目标第一文本段。306: Determine a target first text segment in at least one first text segment according to similarities between each first text segment and multiple second text segments.
示例性的,根据每个第一文本段分别与多个第二文本段之间的相似度,确定每个第一文本段对应的最大相似度,若该最大相似度大于相似度阈值,则将该第一文本段作为目标第一文本段,也就是确定该目标第一文本段描述的子产品为与该最大相似度对应的第二文本段所描述的具有绿色属性的产品。Exemplarily, according to the similarities between each first text segment and multiple second text segments, determine the maximum similarity corresponding to each first text segment, and if the maximum similarity is greater than the similarity threshold, the The first text segment is used as the target first text segment, that is, it is determined that the sub-product described by the target first text segment is the product with the green attribute described by the second text segment corresponding to the maximum similarity.
307:根据目标第一文本段所描述的资产分布,以及各第一数字资产的总金额,确定各第一数字资产中的绿色资产的占比。307: Determine the proportion of green assets in each first digital asset according to the asset distribution described in the target first text paragraph and the total amount of each first digital asset.
示例性的,将目标第一文本段所描述的子产品的占比,作为每个第一数字资产中的绿色资产的占比。应说明,目标第一文本段的数量可以为一个或多个,也就说,该至少一个子产品中存在一个或多个子产品具有绿色属性。Exemplarily, the proportion of the sub-product described in the target first text paragraph is used as the proportion of green assets in each first digital asset. It should be noted that the number of the target first text segment may be one or more, that is to say, one or more sub-products in the at least one sub-product have a green attribute.
示例性的,当目标子文本段的数量为多个时,则多个目标子文本段所描述的子产品的占比进行求和,将求和结果作为每个第一数字资产中的绿色资产的占比。Exemplarily, when the number of target sub-text segments is multiple, the proportions of the sub-products described by the multiple target sub-text segments are summed, and the summation result is used as the green asset in each first digital asset proportion.
参阅图4,图4为本申请实施例提供的一种相似度模型训练方法的流程示意图。该实施例中与图3所示的实施例相同的内容,此处不再重复描述。本实施例的方法包括以下步骤:Referring to FIG. 4 , FIG. 4 is a schematic flowchart of a similarity model training method provided by an embodiment of the present application. The content in this embodiment is the same as that in the embodiment shown in FIG. 3 , and will not be described again here. The method of the present embodiment comprises the following steps:
401:获取第二预设文档,第二预设文档中记载的产品有绿色属性的产品和非绿色属性 的产品。401: Get the second preset document, the products recorded in the second preset document include products with green attributes and products with non-green attributes.
示例性的,通过爬虫技术获取第二预设文档,例如,第二预设文档可以为《2017国民经济行业分类目录2021修订第一版》。该第二预设文档中记载了市场上目前所有的产品。因此,该第二预设文档中记载的产品有绿色属性的产品,也有非绿色属性的产品。Exemplarily, the second preset document is obtained through crawler technology, for example, the second preset document may be "2017 National Economic Industry Classification Catalog 2021 Revised First Edition". All current products on the market are recorded in the second preset document. Therefore, the products recorded in the second default document include products with green attributes and products with non-green attributes.
402:对第二预设文档进行文本识别,得到多个第五文本段,其中,多个第五文本段用于描述第二预设文档中记载的产品。402: Perform text recognition on the second preset document to obtain multiple fifth text segments, where the multiple fifth text segments are used to describe products recorded in the second preset document.
示例性的,对第二预设文档进行实体识别,得到第二预设文档中记载的各个产品;通过文本识别从第二预设文档中提取出描述各个产品的文本段,得到多个第五文本段。Exemplarily, entity recognition is performed on the second preset document to obtain each product recorded in the second preset document; text segments describing each product are extracted from the second preset document through text recognition to obtain multiple fifth text segment.
403:根据多个第五文本段以及多个第二文本段构造多对目标训练样本。403: Construct multiple pairs of target training samples according to multiple fifth text segments and multiple second text segments.
示例性的,对多个第二文本段中的每个第二文本段中的实体进行同义词替换,得到与每个第二文本段对应的第六文本段;然后,将每个第二文本段,以及与该第二文本段对应的第六文本段作为一对训练样本,得到多对第一训练样本。本申请中也可以将多对第一训练样本称为多对相似样本。Exemplarily, synonym replacement is performed on entities in each second text segment in multiple second text segments to obtain a sixth text segment corresponding to each second text segment; then, each second text segment , and the sixth text segment corresponding to the second text segment is used as a pair of training samples to obtain multiple pairs of first training samples. In this application, multiple pairs of first training samples may also be referred to as multiple pairs of similar samples.
应说明,构造了多对第一训练样本之后在训练的过程中,使一对第一训练样本中的两个训练样本之间的距离比较近,这样构造出多对第一训练样本之后,可以让模型能够识别出一些从文字表面上看似不同的产业,其实是相同的绿色产业,从而可以精确的识别出多元化文字表达的绿色产业。It should be noted that after constructing multiple pairs of first training samples, during the training process, the distance between the two training samples in a pair of first training samples is relatively close, so that after constructing multiple pairs of first training samples, you can Let the model identify some seemingly different industries on the surface, but they are actually the same green industry, so that green industries with diversified text expressions can be accurately identified.
示例性的,将多个第五文本段中的多个目标第五文本段剔除,得到多个第七文本段,其中,多个目标第五文本段描述的产品与多个第二文本段描述产品相同,且多个目标第五文本段与多个第二文本段一一对应。Exemplarily, a plurality of target fifth text segments among the plurality of fifth text segments are eliminated to obtain a plurality of seventh text segments, wherein the products described in the plurality of target fifth text segments are different from those described in the plurality of second text segments The products are the same, and the multiple target fifth text segments are in one-to-one correspondence with the multiple second text segments.
具体的,将多个第五文本段与多个第二文本段做差集,得到该多个第七文本段。其中,本申请所指的差集本质上是将文本段描述的产业做差集,即从多个第五文本段中剔除目标第五文本段,得到该多个第七文本段。Specifically, the multiple fifth text segments are subtracted from the multiple second text segments to obtain the multiple seventh text segments. Wherein, the difference set referred to in this application is essentially the difference set of the industry described by the text paragraphs, that is, the target fifth text paragraphs are removed from multiple fifth text paragraphs to obtain the multiple seventh text paragraphs.
应理解,将多个第五文本段与多个第二文本段做差集,得到的多个第七文本段所描述的产品均是具有非绿色属性的产品。It should be understood that by making a difference between the plurality of fifth text segments and the plurality of second text segments, the products described in the obtained plurality of seventh text segments are all products with non-green attributes.
进一步地,确定每个第七文本段对应的第二文本段,其中,该第七文本段所描述的产品与该第二文本段描述的产品相同,但第七文本段描述的产品具有非绿色属性,而第二文本段描述的产品具有绿色属性。例如,第二文本段描述的产品为“节能型工业锅炉”,而第七文本段描述的产品为“工业锅炉”。可以看出,这两个文本段描述的产品均是锅炉,但是“节能型工业锅炉”具有绿色属性,而“工业锅炉”具有非绿色属性。因此,可以将这两个文本段作为一对训练样本。因此,将该第七文本段以及与该第七文本段对应的第二文本段作为一对训练样本,得到多对第二训练样本。本申请中可以将多对第二训练样本称为多对不相似样本。Further, determine the second text segment corresponding to each seventh text segment, wherein the product described in the seventh text segment is the same as the product described in the second text segment, but the product described in the seventh text segment has a non-green color attribute, while the product described by the second text paragraph has a green attribute. For example, the product described in the second text paragraph is "energy-saving industrial boiler", while the product described in the seventh text paragraph is "industrial boiler". It can be seen that the products described in these two text paragraphs are both boilers, but "energy-saving industrial boilers" have green attributes, while "industrial boilers" have non-green attributes. Therefore, these two text segments can be used as a pair of training samples. Therefore, the seventh text segment and the second text segment corresponding to the seventh text segment are used as a pair of training samples to obtain multiple pairs of second training samples. In this application, multiple pairs of second training samples may be referred to as multiple pairs of dissimilar samples.
应说明,之所以构造不相似样本,是因为需要让模型识别虽然看起来表达很近的产品名称,实质上是具有不同属性的产品,学习到这些表达相近的产品名称中哪些关键字词是真正与绿色属性相关的,比如,上述的“节能型工业锅炉”和“工业锅炉”,在训练的过程中,可以让模型记住只有带上“节能型”的锅炉才是具有绿色属性的产品,这样就识别出在这种类似的表达中,“节能型”才是与绿色属性密切相关的关键词。It should be explained that the reason for constructing dissimilar samples is that the model needs to recognize product names that appear to have similar expressions, but are actually products with different attributes, and learn which keywords among these similar product names are genuine. Related to green attributes, for example, the above-mentioned "energy-saving industrial boilers" and "industrial boilers", during the training process, the model can be made to remember that only boilers with "energy-saving" are products with green attributes. In this way, it can be identified that in such similar expressions, "energy-saving" is the key word closely related to green attributes.
最后,将多对第一训练样本和多对第二训练样本作为该多对目标训练样本。Finally, multiple pairs of first training samples and multiple pairs of second training samples are used as the multiple pairs of target training samples.
404:根据多对目标训练样本对初始模型训练,得到相似度模型。404: Train the initial model according to multiple pairs of target training samples to obtain a similarity model.
示例性的,将多对目标训练样本中的每对目标训练样本中的每个训练样本分别输入到初始模型,得到每个训练样本的特征向量,其中,该特征向量用于确定每个训练样本所描述的产品具有绿色属性的概率;然后,根据每个训练样本的特征向量以及每个训练样本的标签,确定每个训练样本对应的第一损失,其中,每个训练样本的标签用于标识每个训练样本所描述的产品是否具有绿色属性的真实情况。应理解,对于相似样本来说,每对相似样本中的两个训练样本的标签是相同的,对于不相似样本来说,每对不相似样本中的两个训练样本的标 签是不同的。Exemplarily, each training sample in each pair of target training samples among multiple pairs of target training samples is respectively input into the initial model to obtain a feature vector of each training sample, wherein the feature vector is used to determine the The probability that the described product has a green attribute; then, according to the feature vector of each training sample and the label of each training sample, the first loss corresponding to each training sample is determined, wherein the label of each training sample is used to identify The truth about whether the product described by each training sample has the green attribute. It should be understood that for similar samples, the labels of the two training samples in each pair of similar samples are the same, and for dissimilar samples, the labels of the two training samples in each pair of dissimilar samples are different.
具体的,根据每个训练样本的特征向量,通过该初始模型的分类器确定每个训练样本所描述的产品具有绿色属性的概率;根据每个训练样本所描述的产品具有绿色属性的概率以及每个训练样本的标签,确定每个训练样本对应的第一损失。Specifically, according to the feature vector of each training sample, the classifier of the initial model determines the probability that the product described by each training sample has the green attribute; according to the probability of the product described by each training sample having the green attribute and each labels of training samples, and determine the first loss corresponding to each training sample.
进一步的,根据每个训练样本的特征向量,确定每对目标训练样本的第二损失,即根据每对目标训练样本中的两个训练样本的特征向量,确定该两个训练样本之间的相似度,将该相似度作为每对目标样本的第二损失。Further, according to the feature vector of each training sample, determine the second loss of each pair of target training samples, that is, according to the feature vectors of the two training samples in each pair of target training samples, determine the similarity between the two training samples degree, and use this similarity degree as the second loss for each pair of target samples.
最后,根据每对目标训练样本中的每个训练样本的第一损失,以及每对目标训练样本对应的第二损失,对初始模型进行训练,得到该相似度模型。Finally, according to the first loss of each training sample in each pair of target training samples and the corresponding second loss of each pair of target training samples, the initial model is trained to obtain the similarity model.
具体的,首先根据每对目标训练样本中的每个训练样本的第一损失,确定初始模型在进行绿色属性分类的过程中的第一目标损失。示例性的,对多对目标训练样本中的所有训练样本的第一损失进行加权求和,得到该第一目标损失。Specifically, firstly, according to the first loss of each training sample in each pair of target training samples, the first target loss of the initial model in the process of classifying the green attributes is determined. Exemplarily, weighted summation is performed on the first losses of all training samples in multiple pairs of target training samples to obtain the first target loss.
示例性的,第一目标损失可以通过公式(7)表示:Exemplarily, the first target loss can be expressed by formula (7):
Figure PCTCN2022090224-appb-000011
Figure PCTCN2022090224-appb-000011
L 1为第一目标损失,avg为求平均操作,n为多对第一训练样本的数量,m为多对第二训练样本的数量,W为初始模型的分类器的权重,f t′为多对目标训练样本中的所有训练样本(即2(n+m))个训练样本中的第t个训练样本,l t为第t个训练样本的标签。 L 1 is the first target loss, avg is the averaging operation, n is the number of pairs of first training samples, m is the number of pairs of second training samples, W is the weight of the classifier of the initial model, f t ' is The t-th training sample among all the training samples in the multi-pair target training samples (ie 2(n+m)) training samples, l t is the label of the t-th training sample.
具体的,根据每对目标训练样本的第二损失,确定初始模型在对每对第一训练样本进行特征提取过程中的损失,得到第二目标损失。示例性的,获取每对第一训练样本的第二损失,并对多对第一训练样本的第二损失求平均,得到该第二目标损失。示例性的,该第二目标损失可以通过公式(8)表示:Specifically, according to the second loss of each pair of target training samples, the loss of the initial model in the process of feature extraction for each pair of first training samples is determined to obtain the second target loss. Exemplarily, the second loss of each pair of first training samples is obtained, and the second loss of multiple pairs of first training samples is averaged to obtain the second target loss. Exemplarily, the second target loss can be expressed by formula (8):
Figure PCTCN2022090224-appb-000012
Figure PCTCN2022090224-appb-000012
其中,L sim为第二目标损失,avg为求平均操作,n为多对第一训练样本的数量,S i为n对第一训练样本中的第i对第一训练样本,
Figure PCTCN2022090224-appb-000013
为第i对第一训练样本中的一个训练样本的特征向量,
Figure PCTCN2022090224-appb-000014
为该第i对第一训练样本中的另一个训练样本的特征向量,|||| 2为求向量之间的相似度(距离)的操作。
Among them, L sim is the second target loss, avg is the averaging operation, n is the number of pairs of first training samples, S i is the i-th pair of first training samples in n pairs of first training samples,
Figure PCTCN2022090224-appb-000013
is the eigenvector of a training sample in the ith pair of first training samples,
Figure PCTCN2022090224-appb-000014
is the feature vector of another training sample in the i-th pair of first training samples, and |||| 2 is an operation for calculating the similarity (distance) between the vectors.
具体的,根据每对目标训练样本的第二损失,确定初始模型在对每对第二训练样本进行特征提取过程中的损失,得到第三目标损失。示例性的,获取每对第二训练样本的第二损失,并对多对第二训练样本的第二损失求平均,得到该第三目标损失。示例性的,第三目标损失可以通过公式(9)表示:Specifically, according to the second loss of each pair of target training samples, the loss of the initial model in the process of feature extraction for each pair of second training samples is determined to obtain the third target loss. Exemplarily, the second loss of each pair of second training samples is obtained, and the second loss of multiple pairs of second training samples is averaged to obtain the third target loss. Exemplarily, the third target loss can be expressed by formula (9):
Figure PCTCN2022090224-appb-000015
Figure PCTCN2022090224-appb-000015
其中,L dissim为第三目标损失,avg为求平均操作,m为多对第二训练样本的数量,S j为m对第二训练样本中的第j对第一训练样本,
Figure PCTCN2022090224-appb-000016
为第j对第二训练样本中的一个训练样本的特征向量,
Figure PCTCN2022090224-appb-000017
为该第j对第二训练样本中的另一个训练样本的特征向量,|||| 2为求向量之间的相似度(距离)的操作。
Among them, L dissim is the third target loss, avg is the averaging operation, m is the number of pairs of second training samples, S j is the jth pair of first training samples in m pairs of second training samples,
Figure PCTCN2022090224-appb-000016
is the feature vector of a training sample in the jth pair of second training samples,
Figure PCTCN2022090224-appb-000017
is the feature vector of another training sample in the jth pair of second training samples, and |||| 2 is an operation for calculating the similarity (distance) between the vectors.
最后,根据第二目标损失和第三目标损失,确定第四目标损失。示例性的,第四目标损失通过公式(10)表示:Finally, a fourth target loss is determined according to the second target loss and the third target loss. Exemplarily, the fourth target loss is expressed by formula (10):
Figure PCTCN2022090224-appb-000018
Figure PCTCN2022090224-appb-000018
其中,L 4为第四损失,k为预设的稳定性参数,用于在L sim为0的情况下,避免第四目 标损失L 4为零,进而防止模型退化。 Among them, L 4 is the fourth loss, and k is a preset stability parameter, which is used to prevent the fourth target loss L 4 from being zero when L sim is 0, thereby preventing model degradation.
之所以设置公式(10)的损失函数,是因为在构造训练样本对的过程中,就决定了第二目标损失L sim需要向着比较小的方向去优化,第三目标损失L dissim需要向着比较大的方向去优化,所以单纯的加权求和无法将两者统一。设置了公式(10)的损失函数之后,则只向着第四目标损失L 4比较小的方向去优化,即可满足第二目标损失L sim和第三目标损失L dissim的优化需求,从而满足整个反向传播过程的优化需求。 The reason why the loss function of formula (10) is set is because in the process of constructing training sample pairs, it is determined that the second target loss L sim needs to be optimized towards a relatively small direction, and the third target loss L dissim needs to be optimized towards a relatively large direction to optimize, so the simple weighted summation cannot unify the two. After the loss function of formula (10) is set, only optimize towards the direction of the fourth target loss L 4 which is relatively small, which can meet the optimization requirements of the second target loss L sim and the third target loss L dissim , thereby satisfying the entire Optimization requirements for the backpropagation process.
最后,将第四目标损失和第一目标损失进行加权,得到最终的目标损失;基于目标损失以及梯度下降法对初始模型进行反向更新,直至初始模型收敛时,得到该相似度模型。Finally, the fourth target loss and the first target loss are weighted to obtain the final target loss; the initial model is reversely updated based on the target loss and the gradient descent method until the initial model converges to obtain the similarity model.
在本申请第一个实施方式中,在构造相似训练样时,除了同义词替换,还可以进行句式的替换。示例性的,对多个第二文本段进行实体识别,得到多个目标实体,其中,多个目标实体与多个第二文本段一一对应,也就是从多个第二文本段中提取出用来描述该多个第一产品的多个目标实体。然后,将每个第二文本段以及与从每个第二文本段中提取出的目标实体作为一对训练样本,得到多对相似样本,这样就构造出了包含不同句式的相似样本。例如“本债券将用于偿还前期水电站建设项目贷款”,则将该第二文本段和“水电站”作为一对相似样本,之所以构造这样的相似样本,是让模型在学习的过程中将“本债券将用于偿还前期水电站建设项目贷款”和“水电站”均识别为绿色产品,因此构造出这种相似样本,可以让模型在学习的过程中可以不受句式的影响,只关心真正与绿色属性相关的字词,从而提高模型的识别精度。In the first embodiment of the present application, when constructing similar training samples, in addition to synonym replacement, sentence pattern replacement can also be performed. Exemplarily, entity recognition is performed on multiple second text segments to obtain multiple target entities, wherein the multiple target entities are in one-to-one correspondence with multiple second text segments, that is, extracted from multiple second text segments A plurality of target entities used to describe the plurality of first products. Then, each second text segment and the target entity extracted from each second text segment are used as a pair of training samples to obtain multiple pairs of similar samples, thus constructing similar samples containing different sentence patterns. For example, "this bond will be used to repay the loan of the previous hydropower station construction project", then the second text segment and "hydropower station" will be used as a pair of similar samples. The reason for constructing such similar samples is to let the model learn " This bond will be used to repay the previous hydropower station construction project loan" and "hydropower station" are identified as green products, so constructing this similar sample can make the model not be affected by sentence patterns during the learning process, and only care about what is really related to Words related to the green attribute, thereby improving the recognition accuracy of the model.
在本申请的一个实施方式中,在构造不相似样本时,针对每个第二文本段,从剩余的目标实体中随机选择一个目标实体,与该第二文本段作为一对不相似样本,可构造出多对不相似样本,其中,该剩余的目标实体为该多个目标实体中除该第二文本段的目标实体之外的所有实体。例如,将上述的“水电站”随机替换为一个目标实体,比如,“风电站”,“其他项目建设”,等等,可以构造出多对不相似样本。构造出这样的不相似样本,可以让模型学习到需要关注的是句式中的实体,对于这种不相似的数据实体不同,需要分类为不同的产品。从而使该模型对于“本债券将用于偿还前期水电站建设项目贷款”以及“风电站”,“其他项目建设”识别为不同属性的产品,从而使如此相近的情况下也能准确匹配到最相似的行业是水电站,即能准确的进行实体匹配,从而提高模型的识别精度。In one embodiment of the present application, when constructing dissimilar samples, for each second text segment, a target entity is randomly selected from the remaining target entities, and the second text segment is used as a pair of dissimilar samples, which can be Multiple pairs of dissimilar samples are constructed, wherein the remaining target entities are all entities in the multiple target entities except the target entity of the second text segment. For example, by randomly replacing the above-mentioned "hydropower station" with a target entity, such as "wind station", "other project construction", etc., multiple pairs of dissimilar samples can be constructed. Constructing such dissimilar samples allows the model to learn that what needs to be paid attention to is the entity in the sentence pattern. For this dissimilar data entity, it needs to be classified into different products. As a result, the model recognizes "this bond will be used to repay the previous hydropower station construction project loan" and "wind power station" and "other project construction" as products with different attributes, so that the most similar situation can be accurately matched in such a similar situation The most popular industry is hydropower stations, which can accurately match entities, thereby improving the recognition accuracy of the model.
参阅图5,图5为本申请实施例提供的一种债券中的绿色资产的占比的识别方法流程示意图。该实施例中与图1、图3、图4所示的实施例相同的内容,此处不再重复描述。本实施例的方法包括以下步骤:Referring to FIG. 5 , FIG. 5 is a schematic flowchart of a method for identifying the proportion of green assets in a bond provided by an embodiment of the present application. The content in this embodiment is the same as the embodiment shown in FIG. 1 , FIG. 3 , and FIG. 4 , and will not be described again here. The method of the present embodiment comprises the following steps:
501:将各第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到多个第一文本段,其中,多个第一文本段用于描述各第一数字资产的多项资金用途。501: Input the disclosure data of each first digital asset into the machine reading comprehension model for text segmentation to obtain multiple first text segments, wherein the multiple first text segments are used to describe multiple fund uses of each first digital asset .
应说明,此处的第一数字资产为该多个第一数字资产的一部分第一数字资产,也就是多个第一数字资产中的已披露债券。It should be noted that the first digital asset here is a part of the first digital assets of the multiple first digital assets, that is, the disclosed bonds among the multiple first digital assets.
首先在确定每个第一数字资产中的绿色资产的占比之前,可以先整体确定第一数字资产是否具有绿色属性,如果确定该第一数字资产没有绿色属性,则可以直接确定该第一数字资产中的绿色资产的占比为0,如果确定该第二数字资产具有绿色属性,则再确定该第一数字资产中的绿色资产的占比。First, before determining the proportion of green assets in each first digital asset, it is possible to determine whether the first digital asset has green attributes as a whole. If it is determined that the first digital asset does not have green attributes, the first digital asset can be directly determined The proportion of green assets in assets is 0, and if it is determined that the second digital asset has green attributes, then determine the proportion of green assets in the first digital asset.
下面详细介绍如何确定第一数字资产是否具有绿色属性的过程。The following describes in detail the process of how to determine whether the first digital asset has green attributes.
示例性的,根据上述的持仓数据,确定每个第一数字资产的资产名称,即债券名称;然后对每个第一数字资产的资产名称进行关键词识别,得到第一关键词,其中,该第一关键词的数量为一个或多个;最后,若该第一关键词为预设关键词集合中的关键词,则确定该第一数字资产具有绿色属性。该预设关键词集合是由各个具有绿色属性且与债券相关的关键词构成的集合,即对各个绿色债券的债券名称进行关键词提取所得到的关键词构成的集合,比如,该预设关键词集合可以包括:“绿色债券”、“碳中和”,“节能”,等等。即从债券名称 确定每个债券是否具有绿色属性,也就是确定每个债券是否为绿色债券。Exemplarily, according to the above position data, determine the asset name of each first digital asset, that is, the bond name; then carry out keyword identification on the asset name of each first digital asset to obtain the first keyword, wherein, the The number of first keywords is one or more; finally, if the first keyword is a keyword in the preset keyword set, it is determined that the first digital asset has a green attribute. The preset keyword set is a set of keywords that have green attributes and are related to bonds, that is, a set of keywords obtained by extracting keywords from the bond names of each green bond. For example, the preset keyword A set of words may include: "green bond", "carbon neutral", "energy efficient", etc. That is, determine whether each bond has green attributes from the bond name, that is, determine whether each bond is a green bond.
示例性的,根据上述的持仓数据,确定每个第一数字资产的所属企业,即从持仓数据中识别出每个债券的发行企业;然后,确定该所属企业的所属行业,比如,可以将该所属企业的主营业务产品所属的行业,作为该所属企业的所属行业。最后,确定该所属行业是否为预设行业集合中的行业,若是,则确定该第一数字资产具有绿色属性,其中,该预设行业集合是由各个具有绿色属性的行业组成的集合。具体的,可获取预设文档,比如,《绿色债券支持项目目录》,然后对该预设文档进行实体提取,可得到一个或多个与绿色相关的绿色行业,例如,公共交通、污水处理等;然后,将这些绿色行业组成集合得到该预设行业集合。即从债券的所属行业确定出债券是否为绿色债券。Exemplarily, according to the position data above, determine the company to which each first digital asset belongs, that is, identify the issuing company of each bond from the position data; then, determine the industry to which the company belongs, for example, the The industry to which the main business product of the affiliated enterprise belongs shall be the industry to which the affiliated enterprise belongs. Finally, it is determined whether the industry to which it belongs belongs to an industry in a preset industry set, and if so, it is determined that the first digital asset has a green attribute, wherein the preset industry set is a set composed of industries with green attributes. Specifically, a preset document can be obtained, such as "Green Bond Support Project Catalogue", and then entity extraction can be performed on the preset document to obtain one or more green industries related to green, such as public transportation, sewage treatment, etc. ; Then, combine these green industries into a set to get the preset industry set. That is to determine whether the bond is a green bond from the industry to which the bond belongs.
举例来说,若第一数字资产的披露数据为:债券的类型为“广州地铁集团有限公司2020年度第二期超短期融资券”,则从该披露数据中确定该债券的发行公司为广州地铁集团有限公司,且该发行公司的所属行业为公共交通。由于公共交通为预设行业集合中的行业,则确定第一数字资产具有绿色属性。For example, if the disclosed data of the first digital asset is: the type of the bond is "Guangzhou Metro Group Co., Ltd. 2020 Phase II Super-short-term Financing Bond", then it can be determined from the disclosed data that the issuing company of the bond is Guangzhou Metro Group Co., Ltd., and the industry of the issuing company is public transportation. Since public transportation is an industry in the preset industry set, it is determined that the first digital asset has a green attribute.
示例性的,对每个第一数字资产的披露数据进行文本识别,从该披露数据中识别出第六文本段,其中,第六文本段为第一数字资产的披露数据中描述该第一数字资产的多项资金用途的文本段。即通过文本定位找到披露数据中描述该债券的各项资金用途的文本段,然后将各项资金用途的文本段从披露数据中提取出来,得到第六文本段;进一步的,对第六文本段进行语义信息提取,得到第六文本段的第三特征向量;然后,根据该第三特征向量预测该第二数字资产具有绿色属性的概率;若该概率大于第二阈值,则确定该第二数字资产具有绿色属性。Exemplarily, text recognition is performed on the disclosure data of each first digital asset, and a sixth text segment is identified from the disclosure data, wherein the sixth text segment is the first digital asset described in the disclosure data of the first digital asset. A text segment for multiple funding purposes for an asset. That is, through text positioning, find the text segment describing each fund use of the bond in the disclosed data, and then extract the text segment of each fund use from the disclosed data to obtain the sixth text segment; further, for the sixth text segment Perform semantic information extraction to obtain a third feature vector of the sixth text segment; then, predict the probability that the second digital asset has a green attribute according to the third feature vector; if the probability is greater than a second threshold, determine the second number Assets have green properties.
在本申请的一个实施方式中,上述确定第二数字资产是否具有绿色属性的方式可以通过训练好的模型实现,该模型可以为fasttext,textCNN,BERT模型,等等,本申请对此不做限定。具体的,从债券样本中提取出用于描述资金用途的文本,将提取出的文本作为样本,并为该样本添加标签,该标签用于标识该债券样本是否具有绿色属性。应理解,在选择债券样本时,应该分别选择具有绿色属性和非绿色属性的债券样本,以保证构造出的样本中包含有正样本和负样本;然后,基于提取出的样本以及该样本的标签进行模型训练,得到一个用于预测债券是否有绿色属性的预测模型;最后,通过该预测模型对第六文本段进行语义信息提取,得到该第六文本段的第三特征向量,并通过该预测模型对该第三特征向量进行处理,预测出该第二数字资产具有绿色属性的概率。In one embodiment of the present application, the above-mentioned method of determining whether the second digital asset has a green attribute can be realized through a trained model, which can be fasttext, textCNN, BERT model, etc., and this application does not limit it . Specifically, the text used to describe the use of funds is extracted from the bond sample, and the extracted text is used as a sample, and a label is added to the sample, and the label is used to identify whether the bond sample has a green attribute. It should be understood that when selecting bond samples, bond samples with green attributes and non-green attributes should be selected respectively to ensure that the constructed samples contain positive samples and negative samples; then, based on the extracted samples and the labels of the samples Carry out model training to obtain a prediction model for predicting whether a bond has a green attribute; finally, use the prediction model to extract semantic information from the sixth text segment to obtain the third feature vector of the sixth text segment, and pass the prediction The model processes the third feature vector to predict the probability that the second digital asset has a green attribute.
应说明,在实际应用中,可以优先选择债券名称或者债券的所属行业确定债券是否具有绿色属性,当这两种方式都无法确定时,则再通过模型预测的方式,去预测债券是否具有绿色属性。It should be noted that in practical applications, the name of the bond or the industry to which the bond belongs can be given priority to determine whether the bond has green attributes. .
应理解,在确定出每个第一数字资产具有绿色属性之后,则可以去识别每个第二数字资产中的绿色资产的比例。示例性的,预先训练好机器阅读理解(Machine Reading Comprehension,MRC)模型,然后将每个第一数字资产的披露数据输入MRC模型中进行文本分割,得到至少一个第一文本段。It should be understood that after it is determined that each first digital asset has a green attribute, the proportion of green assets in each second digital asset can be identified. Exemplarily, a machine reading comprehension (Machine Reading Comprehension, MRC) model is trained in advance, and then the disclosure data of each first digital asset is input into the MRC model for text segmentation to obtain at least one first text segment.
具体的,首先设定MRC所要解决的问题为“哪些文本是用来描述资金的用途的”,输入的文章为每个第一数字资产的披露数据;然后,通过MRC模型的编码层对问题进行编码,得到第一向量;通过MRC模型的编码层对披露数据中的各个文本段进行编码,得到与各个文本段对应的第二向量;然后,将第一向量和各个文本段的第二向量输入到MRC模型的交互层进行交互,得到问题和各个文本段之间的相似度,将相似度大于预设阈值的文本段作为该至少一个第一文本段。Specifically, first set the problem to be solved by MRC as "which texts are used to describe the use of funds", and the input article is the disclosure data of each first digital asset; then, the problem is solved through the coding layer of the MRC model Encoding to obtain the first vector; encoding each text segment in the disclosed data through the encoding layer of the MRC model to obtain a second vector corresponding to each text segment; then, inputting the first vector and the second vector of each text segment Interact with the interaction layer of the MRC model to obtain the similarity between the question and each text segment, and use the text segment whose similarity is greater than the preset threshold as the at least one first text segment.
举例来说,通过MRC模型对每个第一数字资产的披露数据进行文本分割,可得到如表1所示的至少一个第一文本段。For example, by performing text segmentation on the disclosure data of each first digital asset through the MRC model, at least one first text segment as shown in Table 1 can be obtained.
表1:Table 1:
Figure PCTCN2022090224-appb-000019
Figure PCTCN2022090224-appb-000019
502:将各第一文本段输入到语义信息提取模型进行语义信息提取,得到各第一文本段的第一特征向量。502: Input each first text segment into a semantic information extraction model to extract semantic information, and obtain a first feature vector of each first text segment.
其中,该语义信息提取模型是预先训练好的。下面描述对该语义信息提取模型得训练过程。Wherein, the semantic information extraction model is pre-trained. The training process of the semantic information extraction model is described below.
示例性的,首先构建训练样本。例如,从多个债券的披露数据中提取出与资金用途相关的文本段,并为每个文本段打上标签,其中,该标签用于标识该文本段所描述的资金用途具有绿色属性的实际情况,其中,该资金用途可以为用于绿色产业或者为非绿色产业。例如,表1中示出的资金用途:“用于雅砻江卡拉水电站项目建设”用于的产业项目为“雅砻江卡拉水电站项目建设”,则该资金用途具有绿色属性,即该资金用途为绿色产业;然后,将打上有标签的各个文本段作为训练样本。进一步的,构建初始模型,其中,该初始模型可以为Bert模型,其包括语义信息提取模型和多层感知器(Multilayer Perceptron,MLP),其中,该语义信息提取模型和多层感知器的模型参数都是随机初始化得到;然后将训练样本输入到该语义信息提取模型进行语义信息提取,得到训练样本的第四特征向量;将该第四特征向量输入到多层感知器,得到该训练样本属于具有绿色属性的产业的概率;最后,根据该训练样本属于具有绿色属性的产业的概率,以及该训练样本的标签,对该初始模型进行训练,即对语义信息提取模型以及多层感知器的模型参数进行调整,得到目标模型,将该目标模型中的多层感知器删除,得到语义信息提取模型。Exemplarily, a training sample is constructed first. For example, extract text segments related to the use of funds from the disclosure data of multiple bonds, and label each text segment, where the label is used to identify the fact that the use of funds described in the text segment has a green attribute , where the use of the funds can be for green industries or non-green industries. For example, the purpose of funds shown in Table 1: "for the construction of the Yalong River Kara Hydropower Station" is used for the industrial project "Construction of the Yalong River Kara Hydropower Station", then this purpose of funds has a green attribute, that is, the purpose of funds For the green industry; then, each labeled text segment is used as a training sample. Further, construct initial model, wherein, this initial model can be Bert model, and it comprises semantic information extraction model and multilayer perceptron (Multilayer Perceptron, MLP), wherein, the model parameter of this semantic information extraction model and multilayer perceptron are obtained by random initialization; then the training samples are input into the semantic information extraction model for semantic information extraction, and the fourth feature vector of the training sample is obtained; the fourth feature vector is input into the multi-layer perceptron, and the training sample belongs to the The probability of the industry with green attributes; finally, according to the probability that the training sample belongs to the industry with green attributes and the label of the training sample, the initial model is trained, that is, the semantic information extraction model and the model parameters of the multi-layer perceptron Adjustment is made to obtain the target model, and the multi-layer perceptron in the target model is deleted to obtain the semantic information extraction model.
示例性的,可以将每个第一文本段输入到语义信息提取模型进行语义信息提取,得到每个第一文本段的第一特征向量。Exemplarily, each first text segment may be input into a semantic information extraction model for semantic information extraction to obtain a first feature vector of each first text segment.
在实际应用中,在得到目标模型之后,也可以不对目标模型进行删除,直接保留整个目标模型;然后,将每个第五文本段输入到目标模型中进行概率预测,得到每个第五文本段描述的资金用途属于绿色产业的概率,若该概率大于概率阈值,则确定该第五文本段为目标第五文本段,不需要进行相似度的计算,即可直接确定出目标第一文本段,提高绿色资产的占比的识别效率。In practical applications, after obtaining the target model, the target model may not be deleted, and the entire target model may be retained directly; then, each fifth text segment is input into the target model for probability prediction, and each fifth text segment The probability that the described fund use belongs to the green industry, if the probability is greater than the probability threshold, the fifth text segment is determined to be the target fifth text segment, and the target first text segment can be directly determined without similarity calculation. Improve the identification efficiency of the proportion of green assets.
503:将各第二文本段输入到语义信息提取模型进行语义信息提取,得到各第二文本段的第二特征向量,其中,多个第二文本段用于描述多个第一产业,多个第一产业为具有绿色属性的产业。503: Input each second text segment into the semantic information extraction model for semantic information extraction, and obtain the second feature vector of each second text segment, wherein, multiple second text segments are used to describe multiple first industries, multiple The primary industry is an industry with green attributes.
示例性的,获取多个具有绿色属性的产业,即绿色产业。具体的,对《绿色债券支持项目目录》PDF文档进行实体(该实体为产业)识别,得到多个产业,将该多个产业作为该多个第一产业,并从该PDF文档中提取出用于描述该多个第一产业的多个第二文本段,其中,多个第二文本段用于描述该多个第一产业;同样的,将每个第二文本段输入到上述的语义信息提取模型进行语义信息提取,得到每个第二文本段的第二特征向量。Exemplarily, multiple industries with green attributes, ie green industries, are obtained. Specifically, the entity (the entity is an industry) is identified on the PDF document of the "Green Bond Support Project Catalogue", multiple industries are obtained, and the multiple industries are regarded as the multiple primary industries, and the user information is extracted from the PDF document. A plurality of second text segments used to describe the plurality of first industries, wherein the plurality of second text segments are used to describe the plurality of first industries; similarly, each second text segment is input into the above-mentioned semantic information The extraction model performs semantic information extraction to obtain a second feature vector of each second text segment.
504:根据各第一文本段的第一特征向量以及各第二文本段的第二特征向量,确定各第一 文本段分别与多个第二文本段的相似度。504: According to the first feature vector of each first text segment and the second feature vector of each second text segment, determine the similarity between each first text segment and multiple second text segments.
示例性的,可以确定每个第一文本段的第一特征向量与每个第二文本段的第二特征向量之间的相似度,比如,该相似度可以通过两个特征向量之间的欧式距离表征,并将两个特征向量之间的相似度作为每个第一文本段与每个第二文本段之间的相似度。Exemplarily, the similarity between the first feature vector of each first text segment and the second feature vector of each second text segment can be determined, for example, the similarity can be obtained by the Euclidean formula between the two feature vectors distance representation, and use the similarity between two feature vectors as the similarity between each first text segment and each second text segment.
505:根据每个第一文本段与每个第二文本段的相似度,确定多个第一文本段中的目标第一文本段。505: Determine a target first text segment among the multiple first text segments according to the similarity between each first text segment and each second text segment.
示例性的,根据每个第一文本段与每个第二文本段的相似度,确定出与每个第一文本段对应的最大相似度,若该最大相似度大于阈值,则将该第一文本段作为目标第一文本段。具体的,若该最大相似度大于阈值,则说明该第一文本段所描述的资金用途所属的产业为该最大相似度对应的第二文本段描述的第一产业,即该资金用途所支持的产业是一个绿色产业,因此,可以确定出该资金用途具有绿色属性。Exemplarily, according to the similarity between each first text segment and each second text segment, the maximum similarity corresponding to each first text segment is determined, and if the maximum similarity is greater than a threshold, the first text segment as the target first text segment. Specifically, if the maximum similarity is greater than the threshold, it means that the industry to which the fund use described in the first text paragraph belongs is the first industry described in the second text paragraph corresponding to the maximum similarity, that is, the industry supported by the fund use. The industry is a green industry, therefore, it can be determined that the use of funds has green attributes.
506:将目标第一文本段所描述的资金用途中规划的资金金额与各第一数字资产的总金额的比例,作为各第一数字资产中的绿色资产的占比。506: Use the ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each first digital asset as the proportion of green assets in each first digital asset.
示例性的,获取该目标第一文本段所描述的资金用途中所规划的资金金额,并获取该每个第一数字资产的总金额,即获取该第一数字资产的总规模;然后,将该目标第一文本段所描述的资金用途中所规划的资金金额与总金额的比例,作为该每个第一数字资产中的绿色资产的占比。Exemplarily, obtain the amount of funds planned in the use of funds described in the first text paragraph of the target, and obtain the total amount of each first digital asset, that is, obtain the total size of the first digital asset; then, the The proportion of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount is used as the proportion of green assets in each first digital asset.
应说明,该目标第一文本段的数量为一个或多个,也就是说,该每个第一数字资产的多个资金用途中有多项资金用途所应用的产业具有绿色属性。则可以对每个目标第一文本段所描述的资金用途中规划的资金金额与每个第一数字资产的总金额的比例,作为每个目标第一文本段对应的绿色比例;然后,对每个目标第一文本段的绿色比例求和,得到第一数字资产中的绿色资产的占比。It should be noted that the number of the target first text segment is one or more, that is to say, the industries to which multiple fund uses among the multiple fund uses of each first digital asset have green attributes. Then, the proportion of the fund amount planned in the fund use described in the first text paragraph of each target to the total amount of each first digital asset can be used as the green ratio corresponding to the first text paragraph of each target; then, for each The sum of the green proportions of the first text segment of a target is obtained to obtain the proportion of green assets in the first digital asset.
参阅图6,图6为本申请实施例提供的一种绿色资产的占比的识别装置的功能单元组成框图。绿色资产的占比的识别装置600包括:获取单元601和处理单元602;Referring to FIG. 6 , FIG. 6 is a block diagram of functional units of a device for identifying the proportion of green assets provided by an embodiment of the present application. The device 600 for identifying the proportion of green assets includes: an acquisition unit 601 and a processing unit 602;
获取单元601,用于获取待识别数字资产的持仓数据;An acquisition unit 601, configured to acquire position data of digital assets to be identified;
处理单元602,用于对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;The processing unit 602 is configured to perform text recognition on the acquired position data of the digital assets to be identified to obtain a plurality of first digital assets and second digital assets, wherein each of the first digital assets is disclosed in the position data The asset information of the second digital asset is not disclosed in the position data;
获取单元601,还用于根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据;The obtaining unit 601 is further configured to obtain the disclosure data of each of the first digital assets according to the asset information of each of the first digital assets;
处理单元602,还用于将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;The processing unit 602 is further configured to input the disclosure data of each of the first digital assets into the machine reading comprehension model for text segmentation to obtain at least one first text segment, wherein the at least one first text segment is used to describe each asset distribution of the first digital asset;
根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;According to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;determining a target first text segment in the at least one first text segment according to the similarity between each of the first text segments and the plurality of second text segments;
根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;Determine the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets;
根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;According to the portrait of the manager of the digital asset to be identified, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。Determine the proportion of green assets in the digital assets to be identified according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets.
在一些可能的实施方式中,当各所述第一数字资产的披露数据为各所述第一数字资产的 所属企业的年报时,各所述第一数字资产的资产分布为各所述第一数字资产所属企业的子产品的占比,各所述第二文本段所描述的资金分布为具有绿色属性的产品;在将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段方面,处理单元602,具体用于:In some possible implementations, when the disclosure data of each of the first digital assets is the annual report of the enterprise to which each of the first digital assets belongs, the asset distribution of each of the first digital assets is as follows: The proportion of the sub-products of the enterprise to which the digital asset belongs, the distribution of funds described in each of the second text paragraphs is a product with green attributes; after inputting the disclosed data of each of the first digital assets into the machine reading comprehension model for text In terms of segmenting and obtaining at least one first text segment, the processing unit 602 is specifically used for:
对所述年报进行文本识别,得到所述年报中的目标章节,其中,所述目标章节用于描述各所述第一数字资产的所属企业的主营产品,且所述目标章节包括目标表格和目标文本段;Perform text recognition on the annual report to obtain target chapters in the annual report, wherein the target chapters are used to describe the main products of the companies to which each of the first digital assets belongs, and the target chapters include target tables and the target text segment;
将所述目标文本段输入到机器阅读理解模型进行文本分割,得到所述至少一个第一文本段,各所述第一文本段用于描述所述主营产品的一个子产品;Inputting the target text segment into the machine reading comprehension model for text segmentation to obtain the at least one first text segment, each of the first text segments is used to describe a sub-product of the main product;
在根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比方面,处理单元602,具体用于:In terms of determining the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, the processing unit 602 specifically uses At:
对所述目标文本段和所述目标表格均进行实体识别,得到所述主营产品的占比,其中,所述主营产品的占比为所述主营产品的营业额与所述所属企业的总营业额的比值;Entity identification is performed on both the target text segment and the target form to obtain the proportion of the main product, where the proportion of the main product is the turnover of the main product and the value of the affiliated enterprise The ratio of the total turnover of
根据所述主营产品的占比,确定所述主营产品中的各子产品的占比;Determine the proportion of each sub-product in the main product according to the proportion of the main product;
根据各所述子产品的占比,确定所述目标第一文本段描述的子产品的占比;Determine the proportion of the sub-product described in the target first text paragraph according to the proportion of each of the sub-products;
根据所述目标第一文本段描述的子产品的占比,确定各所述第一数字资产中的绿色资产的占比。Determine the proportion of green assets in each of the first digital assets according to the proportion of the sub-products described in the target first text segment.
在一些可能的实施方式中,根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度之前,获取单元601,还用于获取第一预设文档,所述第一预设文档中记载的产品均具有绿色属性;In some possible implementation manners, before determining the similarity between each of the first text segments and multiple second text segments according to the similarity model, the acquiring unit 601 is further configured to acquire the first preset document, The products recorded in the first preset document all have green attributes;
处理单元602,还用于对所述第一预设文档进行文本识别,得到多个第三文本段,其中,所述多个第三文本段用于描述所述第一预设文档中记载的产品;The processing unit 602 is further configured to perform text recognition on the first preset document to obtain multiple third text segments, wherein the multiple third text segments are used to describe the product;
若所述多个第三文本段中的任意一个第三文本段引用其他文档,则对所述其他文档进行文本识别,得到与所述任意一个第三文本段对应的第四文本段,其中,所述第四文本段是所述其他文档中用于描述具有绿色属性的产品的文本;If any third text segment in the plurality of third text segments refers to other documents, perform text recognition on the other documents to obtain a fourth text segment corresponding to any one of the third text segments, wherein, The fourth text segment is the text used to describe products with green attributes in the other documents;
将所述多个第三文本段和所述任意一个第三文本段对应的第四文本段作为所述多个第二文本段;Using the plurality of third text segments and a fourth text segment corresponding to any one of the third text segments as the plurality of second text segments;
分别对所述多个第二文本段中的每个第二文本段进行实体提取,得到多个目标实体;performing entity extraction on each of the plurality of second text segments respectively to obtain a plurality of target entities;
将所述多个第二文本段中的任意一个第二文本段以及从所述任意一个第二文本段中提取出的目标实体作为一对训练样本,得到多对第一训练样本;Using any one of the second text segments in the plurality of second text segments and the target entity extracted from the any one of the second text segments as a pair of training samples to obtain multiple pairs of first training samples;
从所述多个目标实体中除所述任意一个第二文本段对应的目标实体之外的其他目标实体中随机选择一个目标实体,并将随机选择的目标实体与所述任意一个第二文本段作为一对训练样本,得到多对第二训练样本;Randomly select a target entity from other target entities other than the target entity corresponding to the arbitrary second text segment among the plurality of target entities, and combine the randomly selected target entity with the arbitrary second text segment As a pair of training samples, multiple pairs of second training samples are obtained;
将所述多对第一训练样本和所述多对第二训练样本作为多对目标训练样本;using the multiple pairs of first training samples and the multiple pairs of second training samples as multiple pairs of target training samples;
根据所述多对目标训练样本对初始模型进行训练,得到所述相似度模型。The initial model is trained according to the multiple pairs of target training samples to obtain the similarity model.
在一些可能的实施方式中,当各所述第一数字资产的资产分布为各所述第一数字资产的资金用途时,各所述第二文本段描述的资金分布为具有绿色属性的资金用途;在根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度方面,处理单元602,具体用于:In some possible implementations, when the asset distribution of each of the first digital assets is the fund use of each of the first digital assets, the fund distribution described in each of the second text paragraphs is a fund use with a green attribute ; In terms of determining the similarity between each of the first text segments and multiple second text segments according to the similarity model, the processing unit 602 is specifically used for:
将各所述第一文本段输入到语义信息提取模型进行语义信息提取,得到各所述第一文本段的第一特征向量;Inputting each of the first text segments into the semantic information extraction model to extract the semantic information to obtain a first feature vector of each of the first text segments;
将各所述第二文本段输入到所述语义信息提取模型进行语义信息提取,得到所述各所述第二文本段的第二特征向量;Inputting each of the second text segments into the semantic information extraction model for semantic information extraction to obtain a second feature vector of each of the second text segments;
根据各所述第一文本段的第一特征向量以及各所述第二文本段的第二特征向量,确定各所述第一文本段分别与多个第二文本段的相似度;According to the first feature vector of each of the first text segments and the second feature vector of each of the second text segments, determine the similarity between each of the first text segments and a plurality of second text segments;
在根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比方面,处理单元602,具体用于:In terms of determining the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, the processing unit 602 specifically uses At:
将所述目标第一文本段所描述的资金用途中规划的资金金额与各所述第一数字资产的总金额的比例,作为各所述第一数字资产中的绿色资产的占比。The ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each of the first digital assets is taken as the proportion of green assets in each of the first digital assets.
在一些可能的实施方式中,在根据各所述第一数字资产中的绿色资产的占比、以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比方面,处理单元602,具体用于:In some possible implementation manners, according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets, the green color in the digital assets to be identified is determined. In terms of the proportion of assets, the processing unit 602 is specifically used for:
获取各所述第一数字资产的净值相对于所述待识别数字资产的净值的第一比例;Obtain a first ratio of the net value of each of the first digital assets relative to the net value of the digital asset to be identified;
根据各所述第一数字资产的第一比例以及绿色资产的占比,确定各所述第一数字资产的绿色资产相对于所述待识别数字资产的净值的第一占比;According to the first proportion of each of the first digital assets and the proportion of green assets, determine the first proportion of the green assets of each of the first digital assets relative to the net value of the digital assets to be identified;
根据所述持仓数据以及各所述第一字资产的第二比例,确定所述第二数字资产的净值相对于所述待识别数字资产的净值的第二比例;Determine a second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified according to the position data and the second ratio of each of the first digital assets;
根据所述第二数字资产的第二比例以及绿色资产的占比,确定所述第二数字资产的绿色资产相对于所述待识别数字资产的净值的第二占比;According to the second proportion of the second digital asset and the proportion of the green asset, determine the second proportion of the green asset of the second digital asset relative to the net value of the digital asset to be identified;
对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比。Summing the first ratio of each of the first digital assets and the second ratio of the second digital asset to obtain the ratio of green assets in the digital assets to be identified.
在一些可能的实施方式中,对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和之前,处理单元602,还用于对所述持仓数据进行文本识别,得到所述多个第一数字资产中的部分第一数字资产的总金额、所述第二数字资产的总金额,以及所述待识别数字资产的总金额;In some possible implementation manners, before summing the first proportion of each of the first digital assets and the second proportion of the second digital asset, the processing unit 602 is further configured to calculate the position data performing text recognition to obtain the total amount of some of the first digital assets among the plurality of first digital assets, the total amount of the second digital assets, and the total amount of the digital assets to be identified;
对所述持仓数据进行文本识别,得到所述部分第一数字资产的总净值、所述第二数字资产的总净值,以及所述待识别数字资产的总净值;performing text recognition on the position data to obtain the total net value of the part of the first digital asset, the total net value of the second digital asset, and the total net value of the digital asset to be identified;
确定所述部分第一数字资产的总金额和所述第二数字资产的总金额之和,相对于所述待识别数字资产的总金额的第三比例;determining a third ratio of the sum of the total amount of the part of the first digital asset and the total amount of the second digital asset relative to the total amount of the digital asset to be identified;
确定所述部分第一数字资产的总净值和所述第二数字资产的总净值之和,相对于所述待识别数字资产的总净值的第四比例;determining a fourth ratio of the sum of the total net value of the part of the first digital asset and the total net value of the second digital asset relative to the total net value of the digital asset to be identified;
根据所述第三比例和所述第四比例,确定杠杆比例;determining the leverage ratio according to the third ratio and the fourth ratio;
根据所述杠杆比例,分别对所述部分第一数字资产的第一占比和所述第二数字资产的第二占比进行去杠杆,得到所述部分第一数字资产的第一目标占比和所述第二数字资产的第二目标占比;According to the leverage ratio, deleveraging is performed on the first ratio of the part of the first digital asset and the second ratio of the second digital asset to obtain the first target ratio of the part of the first digital asset and the second target ratio of the second digital asset;
在对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比方面,处理单元602,具体用于:In terms of summing the first proportion of each of the first digital assets and the second proportion of the second digital asset to obtain the proportion of green assets in the digital assets to be identified, the processing unit 602, Specifically for:
对所述多个第一数字资产中的另外一部分第一数字资产的第一占比、所述部分第一数字资产的第一目标占比以及所述第二数字资产的第二目标占比进行求和,得到所述待识别数字资产中的绿色资产的占比。The first proportion of another part of the first digital assets in the plurality of first digital assets, the first target proportion of the part of the first digital assets, and the second target proportion of the second digital assets The sum is obtained to obtain the proportion of green assets in the digital assets to be identified.
参阅图7,图7为本申请实施例提供的一种电子设备的结构示意图。如图7所示,电子设备700包括收发器701、处理器702和存储器703。它们之间通过总线704连接。存储器703用于存储计算机程序和数据,并可以将存储器703存储的数据传输给处理器702。Referring to FIG. 7, FIG. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. As shown in FIG. 7 , an electronic device 700 includes a transceiver 701 , a processor 702 and a memory 703 . They are connected through a bus 704 . The memory 703 is used to store computer programs and data, and can transmit the data stored in the memory 703 to the processor 702 .
处理器702用于读取存储器703中的计算机程序执行以下操作:The processor 702 is used to read the computer program in the memory 703 to perform the following operations:
控制收发器701获取待识别数字资产的持仓数据;Controlling the transceiver 701 to obtain the position data of the digital asset to be identified;
对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;performing text recognition on the acquired position data of digital assets to be identified, and obtaining a plurality of first digital assets and second digital assets, wherein the asset information of each of the first digital assets is disclosed in the position data, and the The asset information of the second digital asset is not disclosed in the position data;
控制收发器701根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据;Controlling the transceiver 701 to obtain the disclosure data of each of the first digital assets according to the asset information of each of the first digital assets;
将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;Inputting the disclosure data of each of the first digital assets into the machine reading comprehension model for text segmentation to obtain at least one first text segment, wherein the at least one first text segment is used to describe each of the first digital assets asset distribution;
根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;According to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;determining a target first text segment in the at least one first text segment according to the similarity between each of the first text segments and the plurality of second text segments;
根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;Determine the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets;
根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;According to the portrait of the manager of the digital asset to be identified, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。Determine the proportion of green assets in the digital assets to be identified according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets.
在一些可能的实施方式中,当各所述第一数字资产的披露数据为各所述第一数字资产的所属企业的年报时,各所述第一数字资产的资产分布为各所述第一数字资产所属企业的子产品的占比,各所述第二文本段所描述的资金分布为具有绿色属性的产品;在将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段方面,处理器702具体用于执行以下操作:In some possible implementations, when the disclosure data of each of the first digital assets is the annual report of the enterprise to which each of the first digital assets belongs, the asset distribution of each of the first digital assets is as follows: The proportion of the sub-products of the enterprise to which the digital asset belongs, the distribution of funds described in each of the second text paragraphs is a product with green attributes; after inputting the disclosed data of each of the first digital assets into the machine reading comprehension model for text In terms of segmenting and obtaining at least one first text segment, the processor 702 is specifically configured to perform the following operations:
对所述年报进行文本识别,得到所述年报中的目标章节,其中,所述目标章节用于描述各所述第一数字资产的所属企业的主营产品,且所述目标章节包括目标表格和目标文本段;Perform text recognition on the annual report to obtain target chapters in the annual report, wherein the target chapters are used to describe the main products of the companies to which each of the first digital assets belongs, and the target chapters include target tables and the target text segment;
将所述目标文本段输入到机器阅读理解模型进行文本分割,得到所述至少一个第一文本段,各所述第一文本段用于描述所述主营产品的一个子产品;Inputting the target text segment into the machine reading comprehension model for text segmentation to obtain the at least one first text segment, each of the first text segments is used to describe a sub-product of the main product;
在根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比,处理器702具体用于执行以下操作:After determining the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, the processor 702 is specifically used to execute Do the following:
对所述目标文本段和所述目标表格均进行实体识别,得到所述主营产品的占比,其中,所述主营产品的占比为所述主营产品的营业额与所述所属企业的总营业额的比值;Entity identification is performed on both the target text segment and the target form to obtain the proportion of the main product, where the proportion of the main product is the turnover of the main product and the value of the affiliated enterprise The ratio of the total turnover of
根据所述主营产品的占比,确定所述主营产品中的各子产品的占比;Determine the proportion of each sub-product in the main product according to the proportion of the main product;
根据各所述子产品的占比,确定所述目标第一文本段描述的子产品的占比;Determine the proportion of the sub-product described in the target first text paragraph according to the proportion of each of the sub-products;
根据所述目标第一文本段描述的子产品的占比,确定各所述第一数字资产中的绿色资产的占比。Determine the proportion of green assets in each of the first digital assets according to the proportion of the sub-products described in the target first text segment.
在一些可能的实施方式中,根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度之前,处理器702还用于执行以下操作:In some possible implementations, according to the similarity model, before determining the similarity between each of the first text segments and multiple second text segments, the processor 702 is further configured to perform the following operations:
控制收发器701获取第一预设文档,所述第一预设文档中记载的产品均具有绿色属性;Controlling the transceiver 701 to obtain a first preset document, the products recorded in the first preset document all have green attributes;
对所述第一预设文档进行文本识别,得到多个第三文本段,其中,所述多个第三文本段用于描述所述第一预设文档中记载的产品;performing text recognition on the first preset document to obtain a plurality of third text segments, wherein the plurality of third text segments are used to describe the products recorded in the first preset document;
若所述多个第三文本段中的任意一个第三文本段引用其他文档,则对所述其他文档进行文本识别,得到与所述任意一个第三文本段对应的第四文本段,其中,所述第四文本段是所述其他文档中用于描述具有绿色属性的产品的文本;If any third text segment in the plurality of third text segments refers to other documents, perform text recognition on the other documents to obtain a fourth text segment corresponding to any one of the third text segments, wherein, The fourth text segment is the text used to describe products with green attributes in the other documents;
将所述多个第三文本段和所述任意一个第三文本段对应的第四文本段作为所述多个第二文本段;Using the plurality of third text segments and a fourth text segment corresponding to any one of the third text segments as the plurality of second text segments;
分别对所述多个第二文本段中的每个第二文本段进行实体提取,得到多个目标实体;performing entity extraction on each of the plurality of second text segments respectively to obtain a plurality of target entities;
将所述多个第二文本段中的任意一个第二文本段以及从所述任意一个第二文本段中提取出的目标实体作为一对训练样本,得到多对第一训练样本;Using any one of the second text segments in the plurality of second text segments and the target entity extracted from the any one of the second text segments as a pair of training samples to obtain multiple pairs of first training samples;
从所述多个目标实体中除所述任意一个第二文本段对应的目标实体之外的其他目标实体中随机选择一个目标实体,并将随机选择的目标实体与所述任意一个第二文本段作为一对训练样本,得到多对第二训练样本;Randomly select a target entity from other target entities other than the target entity corresponding to the arbitrary second text segment among the plurality of target entities, and combine the randomly selected target entity with the arbitrary second text segment As a pair of training samples, multiple pairs of second training samples are obtained;
将所述多对第一训练样本和所述多对第二训练样本作为多对目标训练样本;using the multiple pairs of first training samples and the multiple pairs of second training samples as multiple pairs of target training samples;
根据所述多对目标训练样本对初始模型进行训练,得到所述相似度模型。The initial model is trained according to the multiple pairs of target training samples to obtain the similarity model.
在一些可能的实施方式中,当各所述第一数字资产的资产分布为各所述第一数字资产的资金用途时,各所述第二文本段描述的资金分布为具有绿色属性的资金用途;在根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度方面,处理器702具体用于执行以下操作:In some possible implementations, when the asset distribution of each of the first digital assets is the fund use of each of the first digital assets, the fund distribution described in each of the second text paragraphs is a fund use with a green attribute ; In terms of determining the similarity between each of the first text segments and multiple second text segments according to the similarity model, the processor 702 is specifically configured to perform the following operations:
将各所述第一文本段输入到语义信息提取模型进行语义信息提取,得到各所述第一文本段的第一特征向量;Inputting each of the first text segments into the semantic information extraction model to extract the semantic information to obtain a first feature vector of each of the first text segments;
将各所述第二文本段输入到所述语义信息提取模型进行语义信息提取,得到所述各所述第二文本段的第二特征向量;Inputting each of the second text segments into the semantic information extraction model for semantic information extraction to obtain a second feature vector of each of the second text segments;
根据各所述第一文本段的第一特征向量以及各所述第二文本段的第二特征向量,确定各所述第一文本段分别与多个第二文本段的相似度;According to the first feature vector of each of the first text segments and the second feature vector of each of the second text segments, determine the similarity between each of the first text segments and a plurality of second text segments;
在根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比方面,处理器702,具体用于执行以下操作:In terms of determining the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, the processor 702 specifically uses to do the following:
将所述目标第一文本段所描述的资金用途中规划的资金金额与各所述第一数字资产的总金额的比例,作为各所述第一数字资产中的绿色资产的占比。The ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each of the first digital assets is taken as the proportion of green assets in each of the first digital assets.
在一些可能的实施方式中,在根据各所述第一数字资产中的绿色资产的占比、以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比方面,处理器702具体用于执行以下操作:In some possible implementation manners, according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets, the green color in the digital assets to be identified is determined. In terms of the proportion of assets, the processor 702 is specifically configured to perform the following operations:
获取各所述第一数字资产的净值相对于所述待识别数字资产的净值的第一比例;Obtain a first ratio of the net value of each of the first digital assets relative to the net value of the digital asset to be identified;
根据各所述第一数字资产的第一比例以及绿色资产的占比,确定各所述第一数字资产的绿色资产相对于所述待识别数字资产的净值的第一占比;According to the first proportion of each of the first digital assets and the proportion of green assets, determine the first proportion of the green assets of each of the first digital assets relative to the net value of the digital assets to be identified;
根据所述持仓数据以及各所述第一字资产的第二比例,确定所述第二数字资产的净值相对于所述待识别数字资产的净值的第二比例;Determine a second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified according to the position data and the second ratio of each of the first digital assets;
根据所述第二数字资产的第二比例以及绿色资产的占比,确定所述第二数字资产的绿色资产相对于所述待识别数字资产的净值的第二占比;According to the second proportion of the second digital asset and the proportion of the green asset, determine the second proportion of the green asset of the second digital asset relative to the net value of the digital asset to be identified;
对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比。Summing the first ratio of each of the first digital assets and the second ratio of the second digital asset to obtain the ratio of green assets in the digital assets to be identified.
在一些可能的实施方式中,对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和之前,处理器702还用于执行以下操作:In some possible implementation manners, before summing the first percentages of the first digital assets and the second percentages of the second digital assets, the processor 702 is further configured to perform the following operations:
对所述持仓数据进行文本识别,得到所述多个第一数字资产中的部分第一数字资产的总金额、所述第二数字资产的总金额,以及所述待识别数字资产的总金额;performing text recognition on the position data to obtain the total amount of some of the first digital assets among the plurality of first digital assets, the total amount of the second digital assets, and the total amount of the digital assets to be identified;
对所述持仓数据进行文本识别,得到所述部分第一数字资产的总净值、所述第二数字资产的总净值,以及所述待识别数字资产的总净值;performing text recognition on the position data to obtain the total net value of the part of the first digital asset, the total net value of the second digital asset, and the total net value of the digital asset to be identified;
确定所述部分第一数字资产的总金额和所述第二数字资产的总金额之和,相对于所述待识别数字资产的总金额的第三比例;determining a third ratio of the sum of the total amount of the part of the first digital asset and the total amount of the second digital asset relative to the total amount of the digital asset to be identified;
确定所述部分第一数字资产的总净值和所述第二数字资产的总净值之和,相对于所述待识别数字资产的总净值的第四比例;determining a fourth ratio of the sum of the total net value of the part of the first digital asset and the total net value of the second digital asset relative to the total net value of the digital asset to be identified;
根据所述第三比例和所述第四比例,确定杠杆比例;determining the leverage ratio according to the third ratio and the fourth ratio;
根据所述杠杆比例,分别对所述部分第一数字资产的第一占比和所述第二数字资产的第二占比进行去杠杆,得到所述部分第一数字资产的第一目标占比和所述第二数字资产的第二目标占比;According to the leverage ratio, deleveraging is performed on the first ratio of the part of the first digital asset and the second ratio of the second digital asset to obtain the first target ratio of the part of the first digital asset and the second target ratio of the second digital asset;
在对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比方面,处理器702具体用于执行以下操作:In terms of summing the first proportion of each of the first digital assets and the second proportion of the second digital asset to obtain the proportion of green assets in the digital assets to be identified, the processor 702 specifically Used to do the following:
对所述多个第一数字资产中的另外一部分第一数字资产的第一占比、所述部分第一数字资产的第一目标占比以及所述第二数字资产的第二目标占比进行求和,得到所述待识别数字资产中的绿色资产的占比。The first proportion of another part of the first digital assets in the plurality of first digital assets, the first target proportion of the part of the first digital assets, and the second target proportion of the second digital assets The sum is obtained to obtain the proportion of green assets in the digital assets to be identified.
具体地,上述收发器701可为图6所述的实施例的绿色比例识别装置600的获取单元601, 上述处理器702可以为图6所述的实施例的绿色比例识别装置600的处理单元602。Specifically, the above-mentioned transceiver 701 may be the acquisition unit 601 of the green ratio recognition device 600 of the embodiment shown in FIG. 6, and the above-mentioned processor 702 may be the processing unit 602 of the green ratio recognition device 600 of the embodiment shown in FIG. 6 .
应理解,本申请中的电子设备可以包括智能手机(如Android手机、iOS手机、Windows Phone手机等)、平板电脑、掌上电脑、笔记本电脑、移动互联网设备MID(Mobile Internet Devices,简称:MID)或穿戴式设备等。上述电子设备仅是举例,而非穷举,包含但不限于上述电子设备。在实际应用中,上述电子设备还可以包括:智能车载终端、计算机设备等等。It should be understood that the electronic devices in this application may include smart phones (such as Android phones, iOS phones, Windows Phone phones, etc.), tablet computers, palmtop computers, notebook computers, mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc. The above-mentioned electronic devices are only examples, not exhaustive, including but not limited to the above-mentioned electronic devices. In practical applications, the above-mentioned electronic devices may also include: smart vehicle-mounted terminals, computer equipment, and the like.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现如上述方法实施例中记载的任何一种基于文本识别的数字资产中的绿色资产的占比的识别方法的部分或全部步骤。所述计算机可读存储介质可以是非易失性,也可以是易失性。The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to realize any text-based recognition as described in the above-mentioned method embodiments Part or all of the steps in the identification method for the proportion of green assets in digital assets. The computer-readable storage medium may be non-volatile or volatile.
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种基于文本识别的数字资产中的绿色资产的占比的识别方法的部分或全部步骤。The embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to enable the computer to execute the method described in the above method embodiments Part or all of the steps of any method for identifying the proportion of green assets in digital assets based on text recognition.
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The embodiments of the present application have been introduced in detail above, and specific examples have been used in this paper to illustrate the principles and implementation methods of the present application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application; meanwhile, for Those skilled in the art will have changes in specific implementation methods and application scopes based on the ideas of the present application. In summary, the contents of this specification should not be construed as limiting the present application.

Claims (20)

  1. 一种基于文本识别的数字资产中的绿色资产的占比的识别方法,其中,包括:A method for identifying the proportion of green assets in digital assets based on text recognition, including:
    对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;performing text recognition on the acquired position data of digital assets to be identified, and obtaining a plurality of first digital assets and second digital assets, wherein the asset information of each of the first digital assets is disclosed in the position data, and the The asset information of the second digital asset is not disclosed in the position data;
    根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据,并将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;According to the asset information of each of the first digital assets, obtain the disclosure data of each of the first digital assets, and input the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation, to obtain at least one first digital asset a text segment, wherein the at least one first text segment is used to describe the asset distribution of each of the first digital assets;
    根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;According to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
    根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;determining a target first text segment in the at least one first text segment according to the similarity between each of the first text segments and the plurality of second text segments;
    根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;Determine the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets;
    根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;According to the portrait of the manager of the digital asset to be identified, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
    根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。Determine the proportion of green assets in the digital assets to be identified according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets.
  2. 根据权利要求1所述的方法,其中,The method according to claim 1, wherein,
    当各所述第一数字资产的披露数据为各所述第一数字资产的所属企业的年报时,各所述第一数字资产的资产分布为各所述第一数字资产所属企业的子产品的占比,各所述第二文本段所描述的资金分布为具有绿色属性的产品;When the disclosed data of each of the first digital assets is the annual report of the enterprise to which each of the first digital assets belongs, the asset distribution of each of the first digital assets is the sub-product of the enterprise to which each of the first digital assets belongs Proportion, the funds described in each of the second text paragraphs are distributed as products with green attributes;
    所述将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,包括:The disclosure data of each of the first digital assets is input into the machine reading comprehension model for text segmentation to obtain at least one first text segment, including:
    对所述年报进行文本识别,得到所述年报中的目标章节,其中,所述目标章节用于描述各所述第一数字资产的所属企业的主营产品,且所述目标章节包括目标表格和目标文本段;Perform text recognition on the annual report to obtain target chapters in the annual report, wherein the target chapters are used to describe the main products of the companies to which each of the first digital assets belongs, and the target chapters include target tables and the target text segment;
    将所述目标文本段输入到机器阅读理解模型进行文本分割,得到所述至少一个第一文本段,各所述第一文本段用于描述所述主营产品的一个子产品;Inputting the target text segment into the machine reading comprehension model for text segmentation to obtain the at least one first text segment, each of the first text segments is used to describe a sub-product of the main product;
    所述根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比,包括:According to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, determining the proportion of green assets in each of the first digital assets includes:
    对所述目标文本段和所述目标表格均进行实体识别,得到所述主营产品的占比,其中,所述主营产品的占比为所述主营产品的营业额与所述所属企业的总营业额的比值;Entity identification is performed on both the target text segment and the target form to obtain the proportion of the main product, where the proportion of the main product is the turnover of the main product and the value of the affiliated enterprise The ratio of the total turnover of
    根据所述主营产品的占比,确定所述主营产品中的各子产品的占比;Determine the proportion of each sub-product in the main product according to the proportion of the main product;
    根据各所述子产品的占比,确定所述目标第一文本段描述的子产品的占比;Determine the proportion of the sub-product described in the target first text paragraph according to the proportion of each of the sub-products;
    根据所述目标第一文本段描述的子产品的占比,确定各所述第一数字资产中的绿色资产的占比。Determine the proportion of green assets in each of the first digital assets according to the proportion of the sub-products described in the target first text segment.
  3. 根据权利要求2所述的方法,其中,根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度之前,所述方法还包括:The method according to claim 2, wherein, according to the similarity model, before determining the similarity between each of the first text segments and a plurality of second text segments, the method further comprises:
    获取第一预设文档,所述第一预设文档中记载的产品均具有绿色属性;Obtaining a first preset document, the products recorded in the first preset document all have green attributes;
    对所述第一预设文档进行文本识别,得到多个第三文本段,其中,所述多个第三文本段用于描述所述第一预设文档中记载的产品;performing text recognition on the first preset document to obtain a plurality of third text segments, wherein the plurality of third text segments are used to describe the products recorded in the first preset document;
    若所述多个第三文本段中的任意一个第三文本段引用其他文档,则对所述其他文档进行文本识别,得到与所述任意一个第三文本段对应的第四文本段,其中,所述第四文本段是所述其他文档中用于描述具有绿色属性的产品的文本;If any third text segment in the plurality of third text segments refers to other documents, perform text recognition on the other documents to obtain a fourth text segment corresponding to any one of the third text segments, wherein, The fourth text segment is the text used to describe products with green attributes in the other documents;
    将所述多个第三文本段和所述任意一个第三文本段对应的第四文本段作为所述多个第二 文本段;Using the plurality of third text segments and the fourth text segment corresponding to any one of the third text segments as the plurality of second text segments;
    分别对所述多个第二文本段中的每个第二文本段进行实体提取,得到多个目标实体;performing entity extraction on each of the plurality of second text segments respectively to obtain a plurality of target entities;
    将所述多个第二文本段中的任意一个第二文本段以及从所述任意一个第二文本段中提取出的目标实体作为一对训练样本,得到多对第一训练样本;Using any one of the second text segments in the plurality of second text segments and the target entity extracted from the any one of the second text segments as a pair of training samples to obtain multiple pairs of first training samples;
    从所述多个目标实体中除所述任意一个第二文本段对应的目标实体之外的其他目标实体中随机选择一个目标实体,并将随机选择的目标实体与所述任意一个第二文本段作为一对训练样本,得到多对第二训练样本;Randomly select a target entity from other target entities other than the target entity corresponding to the arbitrary second text segment among the plurality of target entities, and combine the randomly selected target entity with the arbitrary second text segment As a pair of training samples, multiple pairs of second training samples are obtained;
    将所述多对第一训练样本和所述多对第二训练样本作为多对目标训练样本;using the multiple pairs of first training samples and the multiple pairs of second training samples as multiple pairs of target training samples;
    根据所述多对目标训练样本对初始模型进行训练,得到所述相似度模型。The initial model is trained according to the multiple pairs of target training samples to obtain the similarity model.
  4. 根据权利要求1所述的方法,其中,The method according to claim 1, wherein,
    当各所述第一数字资产的资产分布为各所述第一数字资产的资金用途时,各所述第二文本段描述的资金分布为具有绿色属性的资金用途;When the asset distribution of each of the first digital assets is the fund use of each of the first digital assets, the fund distribution described in each of the second text paragraphs is a fund use with a green attribute;
    所述根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,包括:According to the similarity model, determining the similarity between each of the first text segments and a plurality of second text segments respectively includes:
    将各所述第一文本段输入到语义信息提取模型进行语义信息提取,得到各所述第一文本段的第一特征向量;Inputting each of the first text segments into the semantic information extraction model to extract the semantic information to obtain a first feature vector of each of the first text segments;
    将各所述第二文本段输入到所述语义信息提取模型进行语义信息提取,得到所述各所述第二文本段的第二特征向量;Inputting each of the second text segments into the semantic information extraction model for semantic information extraction to obtain a second feature vector of each of the second text segments;
    根据各所述第一文本段的第一特征向量以及各所述第二文本段的第二特征向量,确定各所述第一文本段分别与多个第二文本段的相似度;According to the first feature vector of each of the first text segments and the second feature vector of each of the second text segments, determine the similarity between each of the first text segments and a plurality of second text segments;
    所述根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比,包括:According to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, determining the proportion of green assets in each of the first digital assets includes:
    将所述目标第一文本段所描述的资金用途中规划的资金金额与各所述第一数字资产的总金额的比例,作为各所述第一数字资产中的绿色资产的占比。The ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each of the first digital assets is taken as the proportion of green assets in each of the first digital assets.
  5. 根据权利要求4所述的方法,其中,所述根据各所述第一数字资产中的绿色资产的占比、以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比,包括:The method according to claim 4, wherein, according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets, the number to be identified is determined Proportion of green assets in assets, including:
    获取各所述第一数字资产的净值相对于所述待识别数字资产的净值的第一比例;Obtain a first ratio of the net value of each of the first digital assets relative to the net value of the digital asset to be identified;
    根据各所述第一数字资产的第一比例以及绿色资产的占比,确定各所述第一数字资产的绿色资产相对于所述待识别数字资产的净值的第一占比;According to the first proportion of each of the first digital assets and the proportion of green assets, determine the first proportion of the green assets of each of the first digital assets relative to the net value of the digital assets to be identified;
    根据所述持仓数据以及各所述第一字资产的第二比例,确定所述第二数字资产的净值相对于所述待识别数字资产的净值的第二比例;Determine a second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified according to the position data and the second ratio of each of the first digital assets;
    根据所述第二数字资产的第二比例以及绿色资产的占比,确定所述第二数字资产的绿色资产相对于所述待识别数字资产的净值的第二占比;According to the second proportion of the second digital asset and the proportion of the green asset, determine the second proportion of the green asset of the second digital asset relative to the net value of the digital asset to be identified;
    对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比。Summing the first ratio of each of the first digital assets and the second ratio of the second digital asset to obtain the ratio of green assets in the digital assets to be identified.
  6. 根据权利要求5所述的方法,其中,对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和之前,所述方法还包括:The method according to claim 5, wherein, before summing the first proportion of each of the first digital assets and the second proportion of the second digital asset, the method further comprises:
    对所述持仓数据进行文本识别,得到所述多个第一数字资产中的部分第一数字资产的总金额、所述第二数字资产的总金额,以及所述待识别数字资产的总金额;performing text recognition on the position data to obtain the total amount of some of the first digital assets among the plurality of first digital assets, the total amount of the second digital assets, and the total amount of the digital assets to be identified;
    对所述持仓数据进行文本识别,得到所述部分第一数字资产的总净值、所述第二数字资产的总净值,以及所述待识别数字资产的总净值;performing text recognition on the position data to obtain the total net value of the part of the first digital asset, the total net value of the second digital asset, and the total net value of the digital asset to be identified;
    确定所述部分第一数字资产的总金额和所述第二数字资产的总金额之和,相对于所述待识别数字资产的总金额的第三比例;determining a third ratio of the sum of the total amount of the part of the first digital asset and the total amount of the second digital asset relative to the total amount of the digital asset to be identified;
    确定所述部分第一数字资产的总净值和所述第二数字资产的总净值之和,相对于所述待识别数字资产的总净值的第四比例;determining a fourth ratio of the sum of the total net value of the part of the first digital asset and the total net value of the second digital asset relative to the total net value of the digital asset to be identified;
    根据所述第三比例和所述第四比例,确定杠杆比例;determining the leverage ratio according to the third ratio and the fourth ratio;
    根据所述杠杆比例,分别对所述部分第一数字资产的第一占比和所述第二数字资产的第二占比进行去杠杆,得到所述部分第一数字资产的第一目标占比和所述第二数字资产的第二目标占比;According to the leverage ratio, deleveraging is performed on the first ratio of the part of the first digital asset and the second ratio of the second digital asset to obtain the first target ratio of the part of the first digital asset and the second target ratio of the second digital asset;
    所述对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比,包括:The summing of the first proportion of each of the first digital assets and the second proportion of the second digital asset to obtain the proportion of green assets in the digital assets to be identified includes:
    对所述多个第一数字资产中的另外一部分第一数字资产的第一占比、所述部分第一数字资产的第一目标占比以及所述第二数字资产的第二目标占比进行求和,得到所述待识别数字资产中的绿色资产的占比。The first proportion of another part of the first digital assets in the plurality of first digital assets, the first target proportion of the part of the first digital assets, and the second target proportion of the second digital assets The sum is obtained to obtain the proportion of green assets in the digital assets to be identified.
  7. 一种绿色资产的占比的识别装置,其中,包括:获取单元和处理单元;An identification device for the proportion of green assets, including: an acquisition unit and a processing unit;
    所述获取单元,用于获取待识别数字资产的持仓数据;The obtaining unit is used to obtain position data of digital assets to be identified;
    所述处理单元,用于对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;The processing unit is configured to perform text recognition on the acquired position data of the digital asset to be identified to obtain a plurality of first digital assets and second digital assets, wherein each of the first digital assets is disclosed in the position data The asset information of the asset, the asset information of the second digital asset is not disclosed in the position data;
    所述获取单元,还用于根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据;The obtaining unit is further configured to obtain the disclosure data of each of the first digital assets according to the asset information of each of the first digital assets;
    所述处理单元,还用于将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;The processing unit is further configured to input the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation to obtain at least one first text segment, wherein the at least one first text segment is used to describe asset distribution of each of the first digital assets;
    根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;According to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
    根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;determining a target first text segment in the at least one first text segment according to the similarity between each of the first text segments and the plurality of second text segments;
    根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;Determine the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets;
    根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;According to the portrait of the manager of the digital asset to be identified, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
    根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。Determine the proportion of green assets in the digital assets to be identified according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets.
  8. 根据权利要求7所述的装置,其中,The apparatus according to claim 7, wherein,
    当各所述第一数字资产的披露数据为各所述第一数字资产的所属企业的年报时,各所述第一数字资产的资产分布为各所述第一数字资产所属企业的子产品的占比,各所述第二文本段所描述的资金分布为具有绿色属性的产品;When the disclosed data of each of the first digital assets is the annual report of the enterprise to which each of the first digital assets belongs, the asset distribution of each of the first digital assets is the sub-product of the enterprise to which each of the first digital assets belongs Proportion, the funds described in each of the second text paragraphs are distributed as products with green attributes;
    在将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段方面,处理单元,具体用于:In terms of inputting the disclosure data of each of the first digital assets into the machine reading comprehension model for text segmentation to obtain at least one first text segment, the processing unit is specifically configured to:
    对所述年报进行文本识别,得到所述年报中的目标章节,其中,所述目标章节用于描述各所述第一数字资产的所属企业的主营产品,且所述目标章节包括目标表格和目标文本段;Perform text recognition on the annual report to obtain target chapters in the annual report, wherein the target chapters are used to describe the main products of the companies to which each of the first digital assets belongs, and the target chapters include target tables and the target text segment;
    将所述目标文本段输入到机器阅读理解模型进行文本分割,得到所述至少一个第一文本段,各所述第一文本段用于描述所述主营产品的一个子产品;Inputting the target text segment into the machine reading comprehension model for text segmentation to obtain the at least one first text segment, each of the first text segments is used to describe a sub-product of the main product;
    在根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比方面,处理单元,具体用于:In terms of determining the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, the processing unit is specifically used for :
    对所述目标文本段和所述目标表格均进行实体识别,得到所述主营产品的占比,其中,所述主营产品的占比为所述主营产品的营业额与所述所属企业的总营业额的比值;Entity identification is performed on both the target text segment and the target form to obtain the proportion of the main product, where the proportion of the main product is the turnover of the main product and the value of the affiliated enterprise The ratio of the total turnover of
    根据所述主营产品的占比,确定所述主营产品中的各子产品的占比;Determine the proportion of each sub-product in the main product according to the proportion of the main product;
    根据各所述子产品的占比,确定所述目标第一文本段描述的子产品的占比;Determine the proportion of the sub-product described in the target first text paragraph according to the proportion of each of the sub-products;
    根据所述目标第一文本段描述的子产品的占比,确定各所述第一数字资产中的绿色资产的占比。Determine the proportion of green assets in each of the first digital assets according to the proportion of the sub-products described in the target first text segment.
  9. 一种电子设备,其中,包括:处理器和存储器,所述处理器与所述存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述电子设备执行以下步骤的指令:An electronic device, including: a processor and a memory, the processor is connected to the memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that The electronic device executes an instruction of the following steps:
    对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;performing text recognition on the acquired position data of digital assets to be identified, and obtaining a plurality of first digital assets and second digital assets, wherein the asset information of each of the first digital assets is disclosed in the position data, and the The asset information of the second digital asset is not disclosed in the position data;
    根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据,并将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;According to the asset information of each of the first digital assets, obtain the disclosure data of each of the first digital assets, and input the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation, to obtain at least one first digital asset a text segment, wherein the at least one first text segment is used to describe the asset distribution of each of the first digital assets;
    根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;According to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
    根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;determining a target first text segment in the at least one first text segment according to the similarity between each of the first text segments and the plurality of second text segments;
    根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;Determine the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets;
    根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;According to the portrait of the manager of the digital asset to be identified, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
    根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。Determine the proportion of green assets in the digital assets to be identified according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets.
  10. 根据权利要求9所述的电子设备,其中,The electronic device according to claim 9, wherein,
    当各所述第一数字资产的披露数据为各所述第一数字资产的所属企业的年报时,各所述第一数字资产的资产分布为各所述第一数字资产所属企业的子产品的占比,各所述第二文本段所描述的资金分布为具有绿色属性的产品;When the disclosed data of each of the first digital assets is the annual report of the enterprise to which each of the first digital assets belongs, the asset distribution of each of the first digital assets is the sub-product of the enterprise to which each of the first digital assets belongs Proportion, the funds described in each of the second text paragraphs are distributed as products with green attributes;
    所述将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,包括:The disclosure data of each of the first digital assets is input into the machine reading comprehension model for text segmentation to obtain at least one first text segment, including:
    对所述年报进行文本识别,得到所述年报中的目标章节,其中,所述目标章节用于描述各所述第一数字资产的所属企业的主营产品,且所述目标章节包括目标表格和目标文本段;Perform text recognition on the annual report to obtain target chapters in the annual report, wherein the target chapters are used to describe the main products of the companies to which each of the first digital assets belongs, and the target chapters include target tables and the target text segment;
    将所述目标文本段输入到机器阅读理解模型进行文本分割,得到所述至少一个第一文本段,各所述第一文本段用于描述所述主营产品的一个子产品;Inputting the target text segment into the machine reading comprehension model for text segmentation to obtain the at least one first text segment, each of the first text segments is used to describe a sub-product of the main product;
    所述根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比,包括:According to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, determining the proportion of green assets in each of the first digital assets includes:
    对所述目标文本段和所述目标表格均进行实体识别,得到所述主营产品的占比,其中,所述主营产品的占比为所述主营产品的营业额与所述所属企业的总营业额的比值;Entity identification is performed on both the target text segment and the target form to obtain the proportion of the main product, where the proportion of the main product is the turnover of the main product and the value of the affiliated enterprise The ratio of the total turnover of
    根据所述主营产品的占比,确定所述主营产品中的各子产品的占比;Determine the proportion of each sub-product in the main product according to the proportion of the main product;
    根据各所述子产品的占比,确定所述目标第一文本段描述的子产品的占比;Determine the proportion of the sub-product described in the target first text paragraph according to the proportion of each of the sub-products;
    根据所述目标第一文本段描述的子产品的占比,确定各所述第一数字资产中的绿色资产的占比。Determine the proportion of green assets in each of the first digital assets according to the proportion of the sub-products described in the target first text segment.
  11. 根据权利要求10所述的电子设备,其中,根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度之前,所述步骤还包括:The electronic device according to claim 10, wherein, before determining the similarity between each of the first text segments and a plurality of second text segments according to the similarity model, the steps further include:
    获取第一预设文档,所述第一预设文档中记载的产品均具有绿色属性;Obtaining a first preset document, the products recorded in the first preset document all have green attributes;
    对所述第一预设文档进行文本识别,得到多个第三文本段,其中,所述多个第三文本段用于描述所述第一预设文档中记载的产品;performing text recognition on the first preset document to obtain a plurality of third text segments, wherein the plurality of third text segments are used to describe the products recorded in the first preset document;
    若所述多个第三文本段中的任意一个第三文本段引用其他文档,则对所述其他文档进行 文本识别,得到与所述任意一个第三文本段对应的第四文本段,其中,所述第四文本段是所述其他文档中用于描述具有绿色属性的产品的文本;If any third text segment in the plurality of third text segments refers to other documents, perform text recognition on the other documents to obtain a fourth text segment corresponding to any one of the third text segments, wherein, The fourth text segment is the text used to describe products with green attributes in the other documents;
    将所述多个第三文本段和所述任意一个第三文本段对应的第四文本段作为所述多个第二文本段;Using the plurality of third text segments and a fourth text segment corresponding to any one of the third text segments as the plurality of second text segments;
    分别对所述多个第二文本段中的每个第二文本段进行实体提取,得到多个目标实体;performing entity extraction on each of the plurality of second text segments respectively to obtain a plurality of target entities;
    将所述多个第二文本段中的任意一个第二文本段以及从所述任意一个第二文本段中提取出的目标实体作为一对训练样本,得到多对第一训练样本;Using any one of the second text segments in the plurality of second text segments and the target entity extracted from the any one of the second text segments as a pair of training samples to obtain multiple pairs of first training samples;
    从所述多个目标实体中除所述任意一个第二文本段对应的目标实体之外的其他目标实体中随机选择一个目标实体,并将随机选择的目标实体与所述任意一个第二文本段作为一对训练样本,得到多对第二训练样本;Randomly select a target entity from other target entities other than the target entity corresponding to the arbitrary second text segment among the plurality of target entities, and combine the randomly selected target entity with the arbitrary second text segment As a pair of training samples, multiple pairs of second training samples are obtained;
    将所述多对第一训练样本和所述多对第二训练样本作为多对目标训练样本;using the multiple pairs of first training samples and the multiple pairs of second training samples as multiple pairs of target training samples;
    根据所述多对目标训练样本对初始模型进行训练,得到所述相似度模型。The initial model is trained according to the multiple pairs of target training samples to obtain the similarity model.
  12. 根据权利要求9所述的电子设备,其中,The electronic device according to claim 9, wherein,
    当各所述第一数字资产的资产分布为各所述第一数字资产的资金用途时,各所述第二文本段描述的资金分布为具有绿色属性的资金用途;When the asset distribution of each of the first digital assets is the fund use of each of the first digital assets, the fund distribution described in each of the second text paragraphs is a fund use with a green attribute;
    所述根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,包括:According to the similarity model, determining the similarity between each of the first text segments and a plurality of second text segments respectively includes:
    将各所述第一文本段输入到语义信息提取模型进行语义信息提取,得到各所述第一文本段的第一特征向量;Inputting each of the first text segments into the semantic information extraction model to extract the semantic information to obtain a first feature vector of each of the first text segments;
    将各所述第二文本段输入到所述语义信息提取模型进行语义信息提取,得到所述各所述第二文本段的第二特征向量;Inputting each of the second text segments into the semantic information extraction model for semantic information extraction to obtain a second feature vector of each of the second text segments;
    根据各所述第一文本段的第一特征向量以及各所述第二文本段的第二特征向量,确定各所述第一文本段分别与多个第二文本段的相似度;According to the first feature vector of each of the first text segments and the second feature vector of each of the second text segments, determine the similarity between each of the first text segments and a plurality of second text segments;
    所述根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比,包括:According to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, determining the proportion of green assets in each of the first digital assets includes:
    将所述目标第一文本段所描述的资金用途中规划的资金金额与各所述第一数字资产的总金额的比例,作为各所述第一数字资产中的绿色资产的占比。The ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each of the first digital assets is taken as the proportion of green assets in each of the first digital assets.
  13. 根据权利要求12所述的电子设备,其中,所述根据各所述第一数字资产中的绿色资产的占比、以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比,包括:The electronic device according to claim 12, wherein, according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets, the determination of the to-be-identified The proportion of green assets in digital assets, including:
    获取各所述第一数字资产的净值相对于所述待识别数字资产的净值的第一比例;Obtain a first ratio of the net value of each of the first digital assets relative to the net value of the digital asset to be identified;
    根据各所述第一数字资产的第一比例以及绿色资产的占比,确定各所述第一数字资产的绿色资产相对于所述待识别数字资产的净值的第一占比;According to the first proportion of each of the first digital assets and the proportion of green assets, determine the first proportion of the green assets of each of the first digital assets relative to the net value of the digital assets to be identified;
    根据所述持仓数据以及各所述第一字资产的第二比例,确定所述第二数字资产的净值相对于所述待识别数字资产的净值的第二比例;Determine a second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified according to the position data and the second ratio of each of the first digital assets;
    根据所述第二数字资产的第二比例以及绿色资产的占比,确定所述第二数字资产的绿色资产相对于所述待识别数字资产的净值的第二占比;According to the second proportion of the second digital asset and the proportion of the green asset, determine the second proportion of the green asset of the second digital asset relative to the net value of the digital asset to be identified;
    对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比。Summing the first ratio of each of the first digital assets and the second ratio of the second digital asset to obtain the ratio of green assets in the digital assets to be identified.
  14. 根据权利要求13所述的电子设备,其中,对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和之前,所述步骤还包括:The electronic device according to claim 13, wherein, before summing the first proportion of each of the first digital assets and the second proportion of the second digital asset, the step further comprises:
    对所述持仓数据进行文本识别,得到所述多个第一数字资产中的部分第一数字资产的总金额、所述第二数字资产的总金额,以及所述待识别数字资产的总金额;performing text recognition on the position data to obtain the total amount of some of the first digital assets among the plurality of first digital assets, the total amount of the second digital assets, and the total amount of the digital assets to be identified;
    对所述持仓数据进行文本识别,得到所述部分第一数字资产的总净值、所述第二数字资产的总净值,以及所述待识别数字资产的总净值;performing text recognition on the position data to obtain the total net value of the part of the first digital asset, the total net value of the second digital asset, and the total net value of the digital asset to be identified;
    确定所述部分第一数字资产的总金额和所述第二数字资产的总金额之和,相对于所述待 识别数字资产的总金额的第三比例;determining a third ratio of the sum of the total amount of the portion of the first digital asset and the total amount of the second digital asset relative to the total amount of the digital asset to be identified;
    确定所述部分第一数字资产的总净值和所述第二数字资产的总净值之和,相对于所述待识别数字资产的总净值的第四比例;determining a fourth ratio of the sum of the total net value of the part of the first digital asset and the total net value of the second digital asset relative to the total net value of the digital asset to be identified;
    根据所述第三比例和所述第四比例,确定杠杆比例;determining the leverage ratio according to the third ratio and the fourth ratio;
    根据所述杠杆比例,分别对所述部分第一数字资产的第一占比和所述第二数字资产的第二占比进行去杠杆,得到所述部分第一数字资产的第一目标占比和所述第二数字资产的第二目标占比;According to the leverage ratio, deleveraging is performed on the first ratio of the part of the first digital asset and the second ratio of the second digital asset to obtain the first target ratio of the part of the first digital asset and the second target ratio of the second digital asset;
    所述对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比,包括:The summing of the first proportion of each of the first digital assets and the second proportion of the second digital asset to obtain the proportion of green assets in the digital assets to be identified includes:
    对所述多个第一数字资产中的另外一部分第一数字资产的第一占比、所述部分第一数字资产的第一目标占比以及所述第二数字资产的第二目标占比进行求和,得到所述待识别数字资产中的绿色资产的占比。The first proportion of another part of the first digital assets in the plurality of first digital assets, the first target proportion of the part of the first digital assets, and the second target proportion of the second digital assets The sum is obtained to obtain the proportion of green assets in the digital assets to be identified.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以使得计算机执行以下步骤的指令:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to cause the computer to perform the following steps:
    对获取到的待识别数字资产的持仓数据进行文本识别,得到多个第一数字资产和第二数字资产,其中,所述持仓数据中披露了各所述第一数字资产的资产信息,所述持仓数据中未披露所述第二数字资产的资产信息;performing text recognition on the acquired position data of digital assets to be identified, and obtaining a plurality of first digital assets and second digital assets, wherein the asset information of each of the first digital assets is disclosed in the position data, and the The asset information of the second digital asset is not disclosed in the position data;
    根据各所述第一数字资产的资产信息,获取各所述第一数字资产的披露数据,并将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,其中,所述至少一个第一文本段用于描述各所述第一数字资产的资产分布;According to the asset information of each of the first digital assets, obtain the disclosure data of each of the first digital assets, and input the disclosure data of each of the first digital assets into a machine reading comprehension model for text segmentation, to obtain at least one first digital asset a text segment, wherein the at least one first text segment is used to describe the asset distribution of each of the first digital assets;
    根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,其中,所述多个第二文本段用于描述多个具有绿色属性的资金分布;According to the similarity model, determine the similarity between each of the first text segments and a plurality of second text segments, wherein the plurality of second text segments are used to describe a plurality of capital distributions with green attributes;
    根据各所述第一文本段分别与所述多个第二文本段之间的相似度,确定所述至少一个第一文本段中的目标第一文本段;determining a target first text segment in the at least one first text segment according to the similarity between each of the first text segments and the plurality of second text segments;
    根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比;Determine the proportion of green assets in each of the first digital assets according to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets;
    根据所述待识别数字资产的管理者的画像,获取所述管理者管理的所有数字资产,并获取所述所有数字资产中披露了资产信息的数字资产中的绿色资产的平均占比,并将所述平均占比作为所述第二数字资产中的绿色资产的占比;According to the portrait of the manager of the digital asset to be identified, obtain all the digital assets managed by the manager, and obtain the average proportion of green assets among the digital assets whose asset information is disclosed among all the digital assets, and The average proportion is taken as the proportion of green assets in the second digital asset;
    根据各所述第一数字资产中的绿色资产的占比以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比。Determine the proportion of green assets in the digital assets to be identified according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets.
  16. 根据权利要求15所述的计算机可读存储介质,其中,The computer readable storage medium of claim 15, wherein:
    当各所述第一数字资产的披露数据为各所述第一数字资产的所属企业的年报时,各所述第一数字资产的资产分布为各所述第一数字资产所属企业的子产品的占比,各所述第二文本段所描述的资金分布为具有绿色属性的产品;When the disclosed data of each of the first digital assets is the annual report of the enterprise to which each of the first digital assets belongs, the asset distribution of each of the first digital assets is the sub-product of the enterprise to which each of the first digital assets belongs Proportion, the funds described in each of the second text paragraphs are distributed as products with green attributes;
    所述将各所述第一数字资产的披露数据输入到机器阅读理解模型进行文本分割,得到至少一个第一文本段,包括:The disclosure data of each of the first digital assets is input into the machine reading comprehension model for text segmentation to obtain at least one first text segment, including:
    对所述年报进行文本识别,得到所述年报中的目标章节,其中,所述目标章节用于描述各所述第一数字资产的所属企业的主营产品,且所述目标章节包括目标表格和目标文本段;Perform text recognition on the annual report to obtain target chapters in the annual report, wherein the target chapters are used to describe the main products of the companies to which each of the first digital assets belongs, and the target chapters include target tables and the target text segment;
    将所述目标文本段输入到机器阅读理解模型进行文本分割,得到所述至少一个第一文本段,各所述第一文本段用于描述所述主营产品的一个子产品;Inputting the target text segment into the machine reading comprehension model for text segmentation to obtain the at least one first text segment, each of the first text segments is used to describe a sub-product of the main product;
    所述根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比,包括:According to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, determining the proportion of green assets in each of the first digital assets includes:
    对所述目标文本段和所述目标表格均进行实体识别,得到所述主营产品的占比,其中,所述主营产品的占比为所述主营产品的营业额与所述所属企业的总营业额的比值;Entity identification is performed on both the target text segment and the target form to obtain the proportion of the main product, where the proportion of the main product is the turnover of the main product and the value of the affiliated enterprise The ratio of the total turnover of
    根据所述主营产品的占比,确定所述主营产品中的各子产品的占比;Determine the proportion of each sub-product in the main product according to the proportion of the main product;
    根据各所述子产品的占比,确定所述目标第一文本段描述的子产品的占比;Determine the proportion of the sub-product described in the target first text paragraph according to the proportion of each of the sub-products;
    根据所述目标第一文本段描述的子产品的占比,确定各所述第一数字资产中的绿色资产的占比。Determine the proportion of green assets in each of the first digital assets according to the proportion of the sub-products described in the target first text segment.
  17. 根据权利要求16所述的计算机可读存储介质,其中,根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度之前,所述步骤还包括:The computer-readable storage medium according to claim 16, wherein, before determining the similarity between each of the first text segments and a plurality of second text segments according to the similarity model, the steps further include:
    获取第一预设文档,所述第一预设文档中记载的产品均具有绿色属性;Obtaining a first preset document, the products recorded in the first preset document all have green attributes;
    对所述第一预设文档进行文本识别,得到多个第三文本段,其中,所述多个第三文本段用于描述所述第一预设文档中记载的产品;performing text recognition on the first preset document to obtain a plurality of third text segments, wherein the plurality of third text segments are used to describe the products recorded in the first preset document;
    若所述多个第三文本段中的任意一个第三文本段引用其他文档,则对所述其他文档进行文本识别,得到与所述任意一个第三文本段对应的第四文本段,其中,所述第四文本段是所述其他文档中用于描述具有绿色属性的产品的文本;If any third text segment in the plurality of third text segments refers to other documents, perform text recognition on the other documents to obtain a fourth text segment corresponding to any one of the third text segments, wherein, The fourth text segment is the text used to describe products with green attributes in the other documents;
    将所述多个第三文本段和所述任意一个第三文本段对应的第四文本段作为所述多个第二文本段;Using the plurality of third text segments and a fourth text segment corresponding to any one of the third text segments as the plurality of second text segments;
    分别对所述多个第二文本段中的每个第二文本段进行实体提取,得到多个目标实体;performing entity extraction on each of the plurality of second text segments respectively to obtain a plurality of target entities;
    将所述多个第二文本段中的任意一个第二文本段以及从所述任意一个第二文本段中提取出的目标实体作为一对训练样本,得到多对第一训练样本;Using any one of the second text segments in the plurality of second text segments and the target entity extracted from the any one of the second text segments as a pair of training samples to obtain multiple pairs of first training samples;
    从所述多个目标实体中除所述任意一个第二文本段对应的目标实体之外的其他目标实体中随机选择一个目标实体,并将随机选择的目标实体与所述任意一个第二文本段作为一对训练样本,得到多对第二训练样本;Randomly select a target entity from other target entities other than the target entity corresponding to the arbitrary second text segment among the plurality of target entities, and combine the randomly selected target entity with the arbitrary second text segment As a pair of training samples, multiple pairs of second training samples are obtained;
    将所述多对第一训练样本和所述多对第二训练样本作为多对目标训练样本;using the multiple pairs of first training samples and the multiple pairs of second training samples as multiple pairs of target training samples;
    根据所述多对目标训练样本对初始模型进行训练,得到所述相似度模型。The initial model is trained according to the multiple pairs of target training samples to obtain the similarity model.
  18. 根据权利要求15所述的计算机可读存储介质,其中,The computer readable storage medium of claim 15, wherein:
    当各所述第一数字资产的资产分布为各所述第一数字资产的资金用途时,各所述第二文本段描述的资金分布为具有绿色属性的资金用途;When the asset distribution of each of the first digital assets is the fund use of each of the first digital assets, the fund distribution described in each of the second text paragraphs is a fund use with a green attribute;
    所述根据相似度模型,确定各所述第一文本段分别与多个第二文本段之间的相似度,包括:According to the similarity model, determining the similarity between each of the first text segments and a plurality of second text segments respectively includes:
    将各所述第一文本段输入到语义信息提取模型进行语义信息提取,得到各所述第一文本段的第一特征向量;Inputting each of the first text segments into the semantic information extraction model to extract the semantic information to obtain a first feature vector of each of the first text segments;
    将各所述第二文本段输入到所述语义信息提取模型进行语义信息提取,得到所述各所述第二文本段的第二特征向量;Inputting each of the second text segments into the semantic information extraction model for semantic information extraction to obtain a second feature vector of each of the second text segments;
    根据各所述第一文本段的第一特征向量以及各所述第二文本段的第二特征向量,确定各所述第一文本段分别与多个第二文本段的相似度;According to the first feature vector of each of the first text segments and the second feature vector of each of the second text segments, determine the similarity between each of the first text segments and a plurality of second text segments;
    所述根据所述目标第一文本段所描述的资产分布,以及各所述第一数字资产的总金额,确定各所述第一数字资产中的绿色资产的占比,包括:According to the asset distribution described in the target first text paragraph and the total amount of each of the first digital assets, determining the proportion of green assets in each of the first digital assets includes:
    将所述目标第一文本段所描述的资金用途中规划的资金金额与各所述第一数字资产的总金额的比例,作为各所述第一数字资产中的绿色资产的占比。The ratio of the planned fund amount in the fund use described in the first text paragraph of the target to the total amount of each of the first digital assets is taken as the proportion of green assets in each of the first digital assets.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据各所述第一数字资产中的绿色资产的占比、以及所述第二数字资产中的绿色资产的占比,确定所述待识别数字资产中的绿色资产的占比,包括:The computer-readable storage medium according to claim 18, wherein, according to the proportion of green assets in each of the first digital assets and the proportion of green assets in the second digital assets, the determined Describe the proportion of green assets among the digital assets to be identified, including:
    获取各所述第一数字资产的净值相对于所述待识别数字资产的净值的第一比例;Obtain a first ratio of the net value of each of the first digital assets relative to the net value of the digital asset to be identified;
    根据各所述第一数字资产的第一比例以及绿色资产的占比,确定各所述第一数字资产的绿色资产相对于所述待识别数字资产的净值的第一占比;According to the first proportion of each of the first digital assets and the proportion of green assets, determine the first proportion of the green assets of each of the first digital assets relative to the net value of the digital assets to be identified;
    根据所述持仓数据以及各所述第一字资产的第二比例,确定所述第二数字资产的净值相对于所述待识别数字资产的净值的第二比例;Determine a second ratio of the net value of the second digital asset relative to the net value of the digital asset to be identified according to the position data and the second ratio of each of the first digital assets;
    根据所述第二数字资产的第二比例以及绿色资产的占比,确定所述第二数字资产的绿色资产相对于所述待识别数字资产的净值的第二占比;According to the second proportion of the second digital asset and the proportion of the green asset, determine the second proportion of the green asset of the second digital asset relative to the net value of the digital asset to be identified;
    对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比。Summing the first ratio of each of the first digital assets and the second ratio of the second digital asset to obtain the ratio of green assets in the digital assets to be identified.
  20. 根据权利要求19所述的计算机可读存储介质,其中,对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和之前,所述步骤还包括:The computer-readable storage medium according to claim 19, wherein, before summing the first proportion of each of the first digital assets and the second proportion of the second digital asset, the step further comprises :
    对所述持仓数据进行文本识别,得到所述多个第一数字资产中的部分第一数字资产的总金额、所述第二数字资产的总金额,以及所述待识别数字资产的总金额;performing text recognition on the position data to obtain the total amount of some of the first digital assets among the plurality of first digital assets, the total amount of the second digital assets, and the total amount of the digital assets to be identified;
    对所述持仓数据进行文本识别,得到所述部分第一数字资产的总净值、所述第二数字资产的总净值,以及所述待识别数字资产的总净值;performing text recognition on the position data to obtain the total net value of the part of the first digital asset, the total net value of the second digital asset, and the total net value of the digital asset to be identified;
    确定所述部分第一数字资产的总金额和所述第二数字资产的总金额之和,相对于所述待识别数字资产的总金额的第三比例;determining a third ratio of the sum of the total amount of the part of the first digital asset and the total amount of the second digital asset relative to the total amount of the digital asset to be identified;
    确定所述部分第一数字资产的总净值和所述第二数字资产的总净值之和,相对于所述待识别数字资产的总净值的第四比例;determining a fourth ratio of the sum of the total net value of the part of the first digital asset and the total net value of the second digital asset relative to the total net value of the digital asset to be identified;
    根据所述第三比例和所述第四比例,确定杠杆比例;determining the leverage ratio according to the third ratio and the fourth ratio;
    根据所述杠杆比例,分别对所述部分第一数字资产的第一占比和所述第二数字资产的第二占比进行去杠杆,得到所述部分第一数字资产的第一目标占比和所述第二数字资产的第二目标占比;According to the leverage ratio, deleveraging is performed on the first ratio of the part of the first digital asset and the second ratio of the second digital asset to obtain the first target ratio of the part of the first digital asset and the second target ratio of the second digital asset;
    所述对各所述第一数字资产的第一占比和所述第二数字资产的第二占比进行求和,得到所述待识别数字资产中的绿色资产的占比,包括:The summing of the first proportion of each of the first digital assets and the second proportion of the second digital asset to obtain the proportion of green assets in the digital assets to be identified includes:
    对所述多个第一数字资产中的另外一部分第一数字资产的第一占比、所述部分第一数字资产的第一目标占比以及所述第二数字资产的第二目标占比进行求和,得到所述待识别数字资产中的绿色资产的占比。The first proportion of another part of the first digital assets in the plurality of first digital assets, the first target proportion of the part of the first digital assets, and the second target proportion of the second digital assets The sum is obtained to obtain the proportion of green assets in the digital assets to be identified.
PCT/CN2022/090224 2021-10-30 2022-04-29 Method for recognizing proportion of green assets in digital assets and related product WO2023071120A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111280770.2 2021-10-30
CN202111280770.2A CN113902569A (en) 2021-10-30 2021-10-30 Method for identifying the proportion of green assets in digital assets and related products

Publications (1)

Publication Number Publication Date
WO2023071120A1 true WO2023071120A1 (en) 2023-05-04

Family

ID=79027228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090224 WO2023071120A1 (en) 2021-10-30 2022-04-29 Method for recognizing proportion of green assets in digital assets and related product

Country Status (2)

Country Link
CN (1) CN113902569A (en)
WO (1) WO2023071120A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113902568A (en) * 2021-10-30 2022-01-07 平安科技(深圳)有限公司 Method for identifying green asset proportion and related product
CN113902569A (en) * 2021-10-30 2022-01-07 平安科技(深圳)有限公司 Method for identifying the proportion of green assets in digital assets and related products

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154440A (en) * 2017-12-21 2018-06-12 平安科技(深圳)有限公司 FoF assets industry analysis method, terminal and computer readable storage medium
CN110991441A (en) * 2019-12-13 2020-04-10 王文斌 Asset assessment method and device based on image recognition and computer storage medium
KR102173813B1 (en) * 2020-02-28 2020-11-04 신한아이타스(주) Method and apparatus for providing real-time service for stock attribution analysis of fund
CN113902569A (en) * 2021-10-30 2022-01-07 平安科技(深圳)有限公司 Method for identifying the proportion of green assets in digital assets and related products

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10740560B2 (en) * 2017-06-30 2020-08-11 Elsevier, Inc. Systems and methods for extracting funder information from text
CN110781299B (en) * 2019-09-18 2024-03-19 平安科技(深圳)有限公司 Asset information identification method, device, computer equipment and storage medium
CN112214987B (en) * 2020-09-08 2023-02-03 深圳价值在线信息科技股份有限公司 Information extraction method, extraction device, terminal equipment and readable storage medium
CN112734569A (en) * 2020-12-31 2021-04-30 沈阳麟龙科技股份有限公司 Stock risk prediction method and system based on user portrait and knowledge graph
CN112767132B (en) * 2021-01-26 2024-02-02 北京国腾联信科技有限公司 Data processing method and system
CN113065966A (en) * 2021-05-06 2021-07-02 腾讯科技(深圳)有限公司 Method and device for determining types of business products
CN113505601A (en) * 2021-07-08 2021-10-15 平安科技(深圳)有限公司 Positive and negative sample pair construction method and device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154440A (en) * 2017-12-21 2018-06-12 平安科技(深圳)有限公司 FoF assets industry analysis method, terminal and computer readable storage medium
CN110991441A (en) * 2019-12-13 2020-04-10 王文斌 Asset assessment method and device based on image recognition and computer storage medium
KR102173813B1 (en) * 2020-02-28 2020-11-04 신한아이타스(주) Method and apparatus for providing real-time service for stock attribution analysis of fund
CN113902569A (en) * 2021-10-30 2022-01-07 平安科技(深圳)有限公司 Method for identifying the proportion of green assets in digital assets and related products

Also Published As

Publication number Publication date
CN113902569A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN111241837B (en) Theft case legal document named entity identification method based on anti-migration learning
CN110163478B (en) Risk examination method and device for contract clauses
CN109597994B (en) Short text problem semantic matching method and system
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
WO2023071120A1 (en) Method for recognizing proportion of green assets in digital assets and related product
CN111222305A (en) Information structuring method and device
CN112434535B (en) Element extraction method, device, equipment and storage medium based on multiple models
Zhao et al. The study on the text classification for financial news based on partial information
WO2023108985A1 (en) Method for recognizing proportion of green asset and related product
CN114330354B (en) Event extraction method and device based on vocabulary enhancement and storage medium
CN112800239B (en) Training method of intention recognition model, and intention recognition method and device
WO2021139278A1 (en) Intelligent interview method and apparatus, and terminal device
CN109101489A (en) A kind of text automatic abstracting method, device and a kind of electronic equipment
CN111414746A (en) Matching statement determination method, device, equipment and storage medium
AU2019204988A1 (en) Determination of a response to a query
CN113723077B (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN113378090B (en) Internet website similarity analysis method and device and readable storage medium
Haryono et al. Aspect-based sentiment analysis of financial headlines and microblogs using semantic similarity and bidirectional long short-term memory
CN110399477A (en) A kind of literature summary extracting method, equipment and can storage medium
JP2024518458A (en) System and method for automatic topic detection in text
CN109885695A (en) Assets suggest generation method, device, computer equipment and storage medium
WO2023071129A1 (en) Method for identifying proportion of green assets and related product
CN109635289B (en) Entry classification method and audit information extraction method
CN117009516A (en) Converter station fault strategy model training method, pushing method and device
WO2023087935A1 (en) Coreference resolution method, and training method and apparatus for coreference resolution model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885034

Country of ref document: EP

Kind code of ref document: A1