WO2015095229A1

WO2015095229A1 - Constructing industrial sector financial indices

Info

Publication number: WO2015095229A1
Application number: PCT/US2014/070663
Authority: WO
Inventors: James P. SETHNA; Ricky CHACHRA; Alexander A. ALEMI; Paul H. GINSPARG
Original assignee: Cornell University
Priority date: 2013-12-16
Filing date: 2014-12-16
Publication date: 2015-06-25

Abstract

Methods, systems, and devices are disclosed for producing canonical industrial sector financial indices and weighted decomposition of stocks. In one aspect, a computer implemented method for classifying a financial asset in a financial market into sectors based on price returns of the stocks is described. The method includes identifying sectors in the financial market. The method includes creating a weighted decomposition of the identified sectors for each financial asset in a group of financial assets by assigning weights denoting an extent to which each financial asset's price return includes price returns of the identified sectors.

Description

CONSTRUCTING INDUSTRIAL SECTOR FINANCIAL INDICES

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This patent document claims priority to and benefits of U.S. Provisional

Application No. 61/916,719 entitled "METHODS TO CONSTRUCT INDUSTRIAL

SECTOR FINANCIAL INDICES" filed on December 16, 2013. The entire content of the above noted provisional application is incorporated by reference as part of the disclosure of this document.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

[0002] This invention was made with government support under grants DMR 1312160 and OCI 0926550 awarded by the National Science Foundation (NSF). The government has certain rights in the invention.

TECHNICAL FIELD

[0003] This patent document relates to systems and processes that classify and analyze financial markets.

BACKGROUND

[0004] Industry is the production of an economic good or service within an economy. Industries, the countries they reside in, and the economies of those countries are interlinked in a complex web of interdependences. Industries can classified in a variety of ways, including categorization of industries into sectors.

[0005] A financial instrument can include a tradeable asset of any kind (e.g., including cash) evidence of an ownership interest in an entity, or a contractual right to receive or deliver cash or another financial instrument. For example, financial instruments can be categorized by form depending on whether they are cash instruments or derivative instruments.

SUMMARY

[0006] Classification of companies into sectors of the economy is important for macroeconomic analysis and for investments into the sector-specific financial indices or exchange traded funds (ETFs). Major industrial classification systems and financial indices have so far largely been developed by empirical methods with questionable objectivity based on expert opinions and questionable completeness. Because of the dependence on expert opinions, industrial classification systems and financial indices have been developed manually. Disclosed are methods, systems and devices for constructing canonical industrial sector financial indices and weighted decomposition of stocks, which can be implemented to provide actionable information for investments in the canonical sectors of the economy through companies listed on the stock markets and as economic indicators to gauge the performance of industrial sectors. Exemplary implementations of the disclosed techniques are described herein, for example, including showing how a broad-level sector decomposition of stocks can be made objectively and comprehensively a machine learning approach that exploits the emergent low dimensional structure of the space of historical stock price returns. The described techniques can be implemented to automatically identify emergent, "canonical sectors" in the market and assign every stock a participation weight into each sector. Also, for example, by analyzing data from different periods at a time, the exemplary

implementations described herein show how firms listed in the market have evolved in their decomposition into sectors.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] FIG. 1 shows an example of a projection of the stock price returns data space.

[0008] FIGS. 2A and 2B show examples of singular vectors V _s of the SVD of returns [0009] FIG. 3 shows low-dimensional projections of stock returns data.

[0010] FIG. 4 shows projections onto eigenplanes of the factorized returns.

[0011] FIG. 5 shows an exemplary diagram depicting canonical sector decomposition of stocks of exemplary selected companies.

[0012] FIG. 6 shows an exemplary diagram depicting emergent sector time series.

[0013] FIG. 7 shows an exemplary canonical sector time series.

[0014] FIG. 8 shows an exemplary diagram of evolving sector participation weights.

[0015] FIG. 9 shows an example of projections onto eigenplanes of the normalized log price returns.

[0016] FIG. 10 shows an example of projections along eigenplanes of the normalized log price returns.

[0017] FIGS. 11 and 12 show exemplary diagrams of weight distribution in canonical sectors.

[0018] FIG. 13 shows an exemplary plot of normalized distribution of singular values. [0019] FIG. 14 shows an exemplary diagram of canonical sector constituents.

[0020] FIG. 15 shows exemplary Canonical Sector Constituents shown as columns of the

[0021] FIG. 16 shows an exemplary comparison of a 3 Factor Model vs. Fama and French 2D projections of the weights for each company in the SP500 with current tickers and data in the date range considered.

DETAILED DESCRIPTION

[0022] The performance of the economy is often understood in a reductionist way. This entails decomposing the economy into its constituents and then learning how each performed over a given period using the so-called economic indicators. These variables measure unemployment rate, housing starts, consumer price index, gross domestic product, etc., e.g., allowing for broad macroeconomic analysis and modeling.

[0023] In analogy to the broader economy, the performance of financial markets (e.g. stock markets, bond markets, currency markets, commodity markets, oil markets, etc.) is reported similarly in terms of aggregated quantities with groups of financial assets, such as stocks taken at once. The aggregated quantities are referred to as indices that represent a weighted average price of the groups of financial assets, such as stocks. A finer, microlevel analysis quickly becomes impractical because of the plethora of listed financial assets, such as stocks. For example, stocks from over 8000 US public companies are available for trading in various markets. These include over 4500 securities listed on the three major domestic US exchanges, e.g., NASDAQ, NYSE and NYSE MKT. For convenience of analysis and investment, stocks are grouped into indices such as the market- wide Russell 3000 and S&P 500 comprising of stocks from diverse companies to reflect the entirety of the market, and sector-specific indices such as Dow Jones Financials Index, CBOE Oil Index, and Morgan Stanley High-Tech 35 Index that are more granular indicators of performance in individual named sectors.

[0024] In principle, a set of mutually exclusive and collectively exhaustive sector indices could describe the overall stock market as a sum of parts, but for practical applications, this approach is rife with ambiguities. First, to what sector does one assign a conglomerate or diversified company such as General Electric that functions in a variety of businesses across different sectors? Second, how does one account for the participation of non-conglomerates outside their core sectors? For example, if a financial services company is deeply invested in the pharmaceutical sector to an extent that such causal relationship is manifest in strongly correlated returns of the two, should that company be considered part of a financial services index or a healthcare one? Third, as economic environments or companies evolve, neither the industrial sectors nor firms' sector association remains static, so how does one account for the dynamic nature of firms comprising an index?

[0025] The aforementioned technical issues must be resolved in manner that is grounded with a theoretical framework built to describe the character and properties of the underlying entities. For example, a vast number of studies have previously aimed at finding structure and categories of stocks in financial markets with a variety of approaches. Recent numerical techniques have included extensive use of the random matrix theory, principal component analysis or the associated eigenvalue decomposition of the correlation matrix, specialized clustering methods or time series analysis, pairwise coupling analysis, and even topic modeling of returns. Analysis of historical stock price returns elucidated that the high- dimensional space of stock price returns has a low-dimensional representation. This implies that only a few dimensions in the space of price returns have signal and the rest can be ascribed to random noise. While these methods have yielded important results, a

fundamental basis of macro-level analysis, resting upon emergent properties in the markets has so far remained elusive.

[0026] Techniques, systems, and devices are described for constructing canonical industrial sector financial indices and weighted decomposition of stocks, which can be implemented to provide actionable information for investments in the canonical sectors of the economy through companies listed on the stock markets and as economic indicators to gauge the performance of industrial sectors.

[0027] The canonical sector decomposition is a factorization carefully chosen to produce a meaningful window into the underlying structure of the system. The canonical examples of sectors discerned by this decomposition are created in an unsupervised, algorithmic way from the data, allowing for stocks to be described as a weighted convex linear combination of these 'canonical sectors'. In this case, the time series of stock data is centered and normalized. While this approach may seem perverse in the sense that the volatility and growth are typically treated as key pieces of information, this choice enabled the enumeration of the described canonical sectors via archetypal analysis.

[0028] Exemplary implementations of the disclosed techniques are described which demonstrate a new, holistic way of classifying stocks into industrial sectors by utilizing the emergent structure of price returns data space. The exemplary method identifies sectors in the market and assigns each stock weights denoting the extent to which each stock's price return is comprised of emergent sector returns. Relying purely upon an unsupervised machine learning analysis of historical time series of stock price returns, this method is an objective way of understanding stocks solely through their returns. In one aspect, taking the log price returns of individual stocks, removing the overall market return, and normalizing to zero mean and unit standard deviation (s.d.) provide for stock returns that are well- approximated by a hyper-tetrahedral (simplex) structure. For example, in the subsequent sections, exemplary results of the described exemplary implementations are shown that include the space of stock price returns has a hyper-tetrahedral (simplex) structure with each lobe of the hyper-tetrahedron populated by stocks of similar or related businesses as shown in FIG. 1. FIG. 1 shows Low-dimensional projection 100 of the stock price returns data. Stock price returns are projected onto a plane spanned by two stiff vectors from the SVD of the emergent simplex corners as described in this document. Each colored circle corresponds to one of the 705 stocks in the dataset used in the analysis. Colors denote the sectors assigned to companies by Scottrade and the scheme is shown in FIG. 2A. FIGS. 2A and 2B show examples of singular vectors V _s of the SVD of returns R_ts 200 and 250 respectively. The orthonormal right singular vectors (rows of V _s) of SVD of R_ts are equivalent to the eigenvectors of the stock-stock correlation matrix ^_ss,~ff^Tff . Eight of these stiffest eigenvectors including the market mode are shown in rows of two at a time. Each has 705 components corresponding to stocks in the dataset. The market mode with all components in the same direction describes overall fluctuations in the market; it was excluded from the analysis described in the paper. It was previously suggested that each eigenvector of the stock-stock correlation matrix describes a listed sector, however as seen above, a more correct interpretation is that each eigenvector is a mixture of listed sectors with opposite signs in components. For example, the stiffest direction (after market mode) has positive components in real estate and utility, but negative in tech. Less stiff eigenvectors (including the last one shown here), do not contain sector-relevant information. Stocks are colored by listed sectors as shown at the bottom. Listed sector information includes Basic 202, Capital 204, Cyclical 206, Energy 208, Financial 210, Health 212, Non-cyclical 214, Tech 216, Telecom 218, Services 222, Real estate 224, Retail 226, Transport 228. Y-axis range is from -0.5 to 0.3. In FIG. 1, the grey corners 102, 104, 106, 108, 1 10, 1 12, 1 14 and 1 16 of the simplex correspond to sector-defining prototype stocks, whereas all other circles are given by a suitably weighted sum of these grey corners.

[0029] Projections along other singular vectors of raw data (before the factorization) are shown in FIG. 3. FIG. 3 shows a low-dimensional projections of stock returns data. Each colored circle represents a stock in the dataset according to sectors assigned by Scottrade. As described with respect to FIGS. 2A and 2B above, the listed sector information includes Basic 302, Capital 304, Cyclical 306, Energy 308, Financial 310, Health 312, Non-cyclical 314, Tech 316, Telecom 318, Services 322, Real estate 324, Retail 326, Transport 328. The first row is repeated from FIG. 1. Black circles (e.g., 330, 332, 334, 336, 338) represent the archetypes found with the disclosed analysis. The (i; j)^th figure in the grid is a plane spanned by singular vectors i and j +1 (rows of X^TR).

[0030] Projections after the factorization are shown in FIG. 4. FIG 4 shows projections along eigenplanes of the factorized returns 400. Each colored circle represents a stock in the dataset and is colored according to scheme in FIG. 2 based on the primary sector association found after calculations described in this paper. The listed sector information includes Basic 402, Capital 404, Cyclical 406, Energy 408, Financial 410, Health 412, Non-cyclical 414, Tech 416, Telecom 418, Services 422, Real estate 424, Retail 426, Transport 428. Black circles (e.g., 430, 432, 434, 436, 438) represent the archetypes found with the described analysis. The (i; j)^th figure in the grid is a plane spanned by singular vectors i and j + 1 (rows of MN^T) from the calculations described in this document.

[0031] The lobe-corner (canonical sectors) approximates the returns of companies that are prototypical of individual sectors as shown in Table 1. Table 1 shows canonical sectors and major business lines of primary constituent firms. The eight canonical sectors identified by the analysis described here are listed in the column on the left; these were named in accord with the business lines (middle column) of firms that show strong association with these sectors. Examples provided are firms that are strongly associated to these sectors. A full list is available on companion website [www.lassp.cornell.edu/sethna/Finance].

[0032] Table 1 : Canonical sectors and major business lines of primary constituent firms.

Canonical Business lines Prototypical examples sector

c-cyclical general and speciality retail, discretionary AutoZone, Kohl's, Nordstrom goods

c-energy oil and gas services, equipment, Hess, Schlumberger operations

c-flnancial banks, insurance (except health) Citigroup, Wells Fargo, M&T

Bank

c-industrial capital goods, basic materials, transport Dow Chemical Co., Goodyear goods

c-non-cyclical consumer staples, healthcare Pepsi, Proctor & Gamble c-real estate realty investments and operations Vornado Realty, Camden

Property Trust c-technology semiconductors, computers, Intel, Motorola, Oracle communication devices

c-utility electric and gas suppliers Duke Energy, Edison

International

[0033] Each cell of the simplex is populated by stocks of similar returns time series, the corners of the simplex correspond to emergent "canonical" sectors occupied by stocks of companies that are prototypical. Every other stock's return decomposes into a weighted sum (see FIG. 5) of returns from the prototypical stocks (see FIG. 6). FIG 5 shows Canonical sector decomposition of stocks of selected companies. A complete set of all 705 stocks is provided on the companion website [www.lassp.cornell.edu/sethna/Finance]; the color scheme is shown on the right and includes c-cyclical 502, c-energy 504, c-financial 506, c- industrial 508, c-non-cyclical 510, c-real estate 512, c-technology 514, and c-utility 516. Conglomerates like GE decompose roughly into their core business lines. Tech firms such as Apple that sell mass-market consumer goods have an important fraction in c-cyclical, whereas IBM has a significant portion of c-non-cyclical returns presumably due to its government contracts. Telecom companies like AT&T are generally classified under a separate telecom category by major classification systems, yet analysis shows their returns are described by a combination of c-non-cyclical and c-utility sectors. Health insurance providers like Aetna are commonly classified as financial services firms, but their returns consist of a major part c-noncyclical and only a minor part of c-financial— the healthcare sector is generally less prone to economic downturns. Defense contractors like Lockheed are listed as capital goods companies, but their returns are seen to be majority c-non-cyclical and only a smaller share of c-industrial sector. [0034] FIG. 6 shows an exemplary emergent sector time series for c-cyclical 602, c- energy 604, c-financial 606, c-industrial 608, c-non-cyclical 610, c-real estate 612, c- technology 614, and c-utility 616. Annualized cumulative log price returns of the eight emergent sectors are shown. The time series capture all important features affecting different sectors: building-up of the dot-com bubble (c. 2000) followed by a burst, the soaring energy valuations (2003-08) followed by a crash, and financial crisis of 2008. We note that the dotcom bubble was confined to the c-tech whereas the financial crisis effects were spread throughout the sectors. Precise definition of the cumulative returns plotted here is given in (Eqn. S2); other measures of sector dynamics are in FIG. 7.

[0035] FIG. 7 shows an exemplary canonical sector time series. Top row shows normalized log returns (columns of E_tf) for c-cyclical 702, c-energy 704, c-financial 706, c- industrial 708, c-non-cyclical 710, c-real estate 712, c-technology 714, and c-utility 716. Middle row shows cumulative log returns (same as (FIG. 6) and defined in (Eqn. S2)) c- cyclical 718, c-energy 720, c-financial 722, c-industrial 724, c-non-cyclical 726, c-real estate 728, c-technology 730, and c-utility 732. Bottom row shows unweighted price index of canonical sectors (Eqn. S4) c-cyclical 734, c-energy 736, c-financial 738, c-industrial 740, c- non-cyclical 742, c-real estate 744, c-technology 746, and c-utility 748.

[0036] The participation weights of the companies are dynamic and provide insights into their evolving nature as shown in FIG. 8 shows Evolving sector participation weights.

Results from the sector decomposition made with rolling two-year Gaussian windows are shown for selected stocks. A complete set of 705 charts is provided on the companion website [www.lassp.cornell.edu/sethna/Finance]. Color scheme is as in FIG. 5 and includes c-cyclical 502, c-energy 504, c-financial 506, c-industrial 508, c-non-cyclical 510, c-real estate 512, c-technology 514, and c-utility 516 canonical sectors. For stable and focused companies such as Pacific Gas & Electric or IBM, one sees no significant shifts in sector weights. Wal-Mart's returns, on the other hand, have moved significantly from c-cyclical to c-non-cyclicals (consumer staples) in the post-financial crises years as shown; this is also true of other low-price consumer commodities retailers such as Costco, but not true of higher price retailers such as Whole Foods, Macy's, etc. Corning, previously an industrial firm with a huge presence in optical fiber, suffered in the aftermath of the dot-com crisis and now is classified as a tech firm presumably due to its Gorilla® glass used in cellphones, laptop displays, and tablets. Berry Petroleum grew within its home state of California in the early 1990s through development on properties that were purchased in the earlier part of 20th century. In 2003, the company embarked on a transformation by direct acquisition of light oil and natural gas production facilities outside California. The figure shows a clear shift in the distribution of sector weights as the company has moved toward c-energy and away from c-real estate. Similarly, as Plum Creek Timber converted to a real estate investment trust (REIT) in the late 1990s, its sector weights have also significantly shifted toward c-real estate sector as shown.

[0037] Canonical Sectors and Price Returns

[0038] The high-dimensional space of stock price returns has a low-dimensional representation. This implies that only a few dimensions in the space of price returns have signal, and the rest can be ascribed to random noise. The matrix of daily log returns of a stock s are defined as r_ts = log _ts— log _(t__1)S where P_ts are adjusted closing prices (i.e. corrected for stock splits and dividend issues) and t is in trading days. In the present analysis, we used normalized returns, R_t'_s = (r_ts - (r_ts)_t)/a_s where σ} = (r_t ² _s)_t - (r_ts is the variance (squared volatility). Overall market returns from each stock were also removed, yielding R_ts = R_t'_s— (R_t'_s)_s. A key feature of the present technology is that the low- dimensional representation of R (stock price returns) has a well-defined structure leading to new insights about individual stocks and the industrial sectors of the economy. This structure is an emergent hyper-tetrahedron (also known as a simplex) that becomes apparent upon visualizing low-dimensional projections of the exemplary data, as shown in FIGS. 1, 9 and 4. For FIG. 9, the canonical sectors are the same as in FIG. 4 including Basic 402, Capital 404, Cyclical 406, Energy 408, Financial 410, Health 412, Non-cyclical 414, Tech 416, Telecom 418, Services 422, Real estate 424, Retail 426, Transport 428. Black circles (e.g., 430, 432, 434, 436, 438) represent the archetypes found with the described analysis. The simplex shown is an emergent, self-organized structure: the corners of every cell comprise of companies that are prototypical of known sectors (e.g., Texas Instruments, Wells Fargo, Kohl's, etc.). Each cell of the simplex is populated by stocks of companies in similar or related business lines implying that every cell corresponds to an identifiable segment of the economy. Moreover, closely related firms clump together in each lobe, and in the zero- centered simplex, data points located near the origin predominantly correspond to stocks of conglomerates or diversified companies (e.g., GE, Walt Disney, 3M, etc.) and small cap companies. The number of lobes denote how many distinct sectors are exhibited by the data.

[0039] As described above, FIG. 1 shows a projection of the stock price returns data space. Stock price returns decomposed according to analysis described here are projected onto a plane spanned by two stiffest eigenvectors of the singular value decomposition (SVD). Each colored circle corresponds to a stock in the dataset used in the analysis. Colors represent the eight emergent sectors identified in the exemplary dataset of 705 US companies used. The grey corners of the simplex correspond to sector-defining prototype stocks, whereas all other circles are given by a suitably weighted sum of these grey corners. This and additional projections are shown in FIG. 10 with axes labeled. Basic 402, Capital 404, Cyclical 406, Energy 408, Financial 410, Health 412, Non-cyclical 414, Tech 416, Telecom 418, Services 422, Real estate 424, Retail 426, Transport 428. Black circles (e.g., 430, 432, 434, 436, 438) represent the archetypes found with the described analysis.

[0040] How many emergent sectors are there in the market? The general problem of selecting a signal to noise ratio cutoff or a truncation threshold in high-dimensional data does not always have a clear answer. As is the case with stock price returns, the threshold is generally sensitive to sampling, but nonetheless reasonably robust for qualitative results. The described techniques are used to apply a dimensional-reduction algorithm inspired by the simplex geometry of returns to construct a hyper-tetrahedron with vertices inside the convex- hull of the dataset. The exemplary dataset used in this analysis included two decades (e.g., 1993-2013) of daily price returns from 705 US public companies each with a mid-2013 market capitalization of $1 billion or higher, e.g., representing a broad section of the economy in a period marked by major crises. This exemplary data set has eight emergent sectors which are named as follows, for example (the prefix c-signifies "canonical" and distinguishes these names from listed sectors names more commonly used): c-cyclical

(including retail), c-energy (including oil and gas), c-industrial (including capital goods and basic materials), c-financial, c-non-cyclical (including healthcare and consumer non-cyclical goods), c-real estate, c-technology, and c-utility. Calculated participation weights for a sample of 12 firms shown in FIG. 5 show a decomposition of their stocks into the canonical sectors. Also, the exemplary prices returns (shown in FIGS. 6 and 7) from these exemplary sectors show the performance of the different industrial sectors of the economy, e.g., including major events that afflicted each such as the few described below.

[0041] Associated with each canonical sector /is a time series of returns. As expected, these series show hallmark historical events of individual sectors (Fig. 6): the dot-com bubble, the energy crisis, and the global energy crisis being the major events in the last two decades. Dot-com bubble: The building-up of the speculative bubble spanning 1997-2000 and its subsequent crash over two years that followed is clearly seen in the returns of the tech sector. One also sees that the tech bubble was primarily contained within the tech companies' ecosystem with only minor remnants elsewhere. Energy crisis: In the period spanning 2003-2008, crude oil price witnessed a four-fold increase (primarily ascribed to disruptions caused by Hurrican Katrina and Iranian nuclear crisis), and then precipitously dropped following the onset of the global recession. Energy stocks also plunged headlong. Global financial crisis: The financial crisis of 2008 affected the entirety of the market, but had particularly grave implications for the real-estate and the financial sectors.

[0042] These emergent time series, E,f , are basis vectors that together with weights W_& describe a best- fit decomposition of R as a matrix factorization R_ts = E_tfWf_s with the constraint ∑ _j Wf_S = 1. An additional convexity constraint ensures that the columns of E represent the simplex corners of the dataset: E_tf = R_ts'C_s'f where∑_sf C_s<f = 1. A matrix factorization with constraints as defined here and is known as Archetypal Analysis (AA). Each of these algorithms factorizes R into a product RCW by minimizing the Frobenius matrix norm

|| ff_ts— ff_ts'C_s'^I/l/^_s ||^ subject to aforementioned constraints. The number n of canonical sectors /is user-specified. The resulting factorization is thus a best- fit simplex to the data with vertices ^/constrained to be inside the convex hull of the original data as desired.

[0043] FIG. 6 shows an exemplary diagram depicting emergent sector time series.

Annualized cumulative log price returns of the eight emergent sectors c-cyclical 602, c- energy 604, c-financial 606, c-industrial 608, c-non-cyclical 610, c-real estate 612, c-tech 614 and c-utility 616 are shown. The time series capture all important features affecting different sectors, for example: dot-com bubble (c. 2000), the energy and financial crises of 2008. Precise definitions are given in Equation 3, for example; other measures of sector dynamics are shown in FIG. 7. Top row shows normalized log returns (columns of E_tf) for c-cyclical 702, c-energy 704, c-financial 706, c-industrial 708, c-non-cyclical 710, c-real estate 712, c- technology 714, and c-utility 716. Middle row shows cumulative log returns (same as (FIG. 6) and defined in (Eqn. S2)) c-cyclical 718, c-energy 720, c-financial 722, c-industrial 724, c- non-cyclical 726, c-real estate 728, c-technology 730, and c-utility 732. Bottom row shows unweighted price index of canonical sectors (Eqn. S4) c-cyclical 734, c-energy 736, c- financial 738, c-industrial 740, c-non-cyclical 742, c-real estate 744, c-technology 746, and c- utility 748.

[0044] Constituent Firms in Canonical Sectors

[0045] As mentioned in the preceding section, the exemplary eight sectors emerge in analysis of the exemplary dataset. Here some high-level defining features are listed of each of these sectors. This discussion that follows is summarized in Table 1 shown above.

[0046] Firms showing strong association to what is called here as the c-cyclical sector include specialty and general retail outlets; well-known names include Best Buy, Kohl's, Target, Tiffany, etc. The canonical sector c-energy firms are either integrated oil and gas firms (e.g., Exxon), or are involved in operations (e.g., Hess), or provide services within this sector (e.g., Halliburton), c-financial sector firms include large and small banks, all kinds of insurance companies with the notable exception of health insurance firms. Bank of America, Citigroup, Wells Fargo, etc. strongly associate with this emergent sector. The c-industrial goods sector firms are involved often specialized large-scale manufacturing of basic materials (paper products, chemicals etc.) or capital goods (machineries); as example, Dow Chemical Company is strongly linked to this sector. The c-non-cyclical sector is comprised of consumer staple goods (food, beverage) but also healthcare firms. Coca-Cola, Kellogg,

Pfizer, Merck and many other household names are all members of this group. c-Real estate sector is almost exclusive linked to firms with heavy real estate operations including real estate investment trusts, insurers, etc. The c-tech sector primarily comprised of

semiconductor, hardware, software and communication equipment manufacturing firms such as Cisco, Intel, Oracle, Motorola, etc. Core c-utility firms are in electric or gas supply business; examples include Duke Energy Corp., Edison International, etc.

[0047] Sector Decomposition of All Firms

[0048] Each stock return is a weighted combination of returns from the emergent sectors.

In matrix form this is written as: R_ts = E_tf Wf_S, where matrices R, E and W contain

(normalized log) returns at times t for stocks s, returns of the emergent sectors and participation weights respectively. The latter are constrained so that for each stock, the participation weights in multiple sectors add to unity. Calculations and further details are described below. The important insights in FIG. 5 are discussed here.

[0049] FIG. 5 shows an exemplary diagram depicting a canonical sector decomposition of stocks of exemplary selected companies. For example, a complete set of pictures for all

705 stocks is provided on the companion website [www.lassp.cornell.edu/sethna/Finance].

For example, color scheme shown on the right are used in figures throughout the patent document, except where noted.

[0050] Conglomerates decompose into their core constituents. For example, calculations show that General Electric's returns are comprised of four segments: c-financials, c-non- cyclical, c-tech and c-cyclical, while 3M is in the business of c-industrial and c-non-cyclicals. Technology companies such as Apple that sell mass-market consumer goods also have important fraction in c-retail sector in addition to c-tech, whereas IBM having significant government contracts and healthcare analytics products has a significant portion of c-non- cyclical returns. Telecom companies, for example AT&T and Verizon, are generally classified under a separate major category of their by many classification systems, yet the present analysis shows their returns are described by a combination of c-non-cyclical and c- utility components. Returns of health insurance providers such as Aetna, United Healthcare, etc. that are commonly classified as financial services firms, are comprised of a major part c- non-cyclical and minor part of c-financial sector. Defence contractors like Lockheed, Northrop Grumman, Raytheon that are primarily listed as capital goods companies have their returns comprised of a majority c-non-cyclical component and only a smaller share of c- cyclical sector.

[0051] Evolution of Sector Weights

[0052] The sector decomposition of firms is by no means static. As companies grow, their business foci often change. They may enter or leave different sectors through mergers, acquisitions, spin-offs, new products or target customers. The decomposition analysis described above was used with one-year overlapping windows of time to get insight into the evolving nature of sector participation of firms.

[0053] Major events affecting companies in an idiosyncratic manner show clear signature in this analysis. For example, Corning Inc., not traditionally a tech firm, suffered in the aftermath of the dot-com crisis due to its reliance upon developing products and

infrastructure for other tech firms. As such, the company has since then drastically shifted toward tech after the bubble burst.

[0054] FIG. 8 shows an exemplary diagram of evolving sector participation weights 800. Results from the sector decomposition made with rolling two-year Gaussian windows are shown for selected stocks. For example, a complete set of the exemplary 705 pictures is provided on the companion website [www.lassp.cornell.edu/sethna/Finance]. Color scheme is as in FIG. 5.

[0055] Likewise, a growing company's strategy shift is also seen in the analysis. For example, in the early 1990s, Berry Petroleum grew within its home state of California through development on properties that were purchased in the earlier part of 20th century. In 2003, the company embarked on a transformation by direct acquisition of light oil and natural gas production facilities outside California. FIG. 8 shows a clear shift in the distribution of sector weights as the company has moved more squarely toward c-energy and away from c- real estate. Similarly, as Plum Creek Timber converted to a real estate investment trust (REIT) in the late 1990s, its sector weights have also significantly shifted toward c-real estate sector as shown.

[0056] Lastly, for stable and focused companies such as Pacific Gas & Electric or IBM (FIG. 8), one sees no significant shifts in sector weights. Wal-Mart returns, on the other hand, have moved from significantly from c-cyclical to c-non-cyclicals (consumer staples) in the post- financial crises years as shown. This is also true of other low-price consumer commodities retailers such as Costco, but not true of higher price retailers such as Whole Foods, Macy's, etc.

[0057] Exemplary Discussion of the Exemplary Data and Results

[0058] The exemplary dataset analyzed comprised of daily returns for a 20 year period for 705 US companies with $ 1 billion or more in market capitalization. While only a small subset of the business are publicly traded and even fewer have market caps as high as a billion, the exemplary dataset nonetheless represents an excellent segment of the US economy by including a broad diversity of firms and the conditions they witnessed in the previous two decades including at least three major domestic crises and their aftershocks.

[0059] As shown in the exemplary data, the space of stock price returns has a hyper- tetrahedral structure. This structure is inherent in data and has emerged out of a multitude of microscopic interactions (trades) between a plethora of participants. The simplex is not only a low-dimensional manifold representation of this high-dimensional data, but it also has a meaningful sub-structure: Each cell of the simplex is populated by stocks of companies in related businesses, and each corner of the hyper-tetrahedron represents "pure types" of companies that are strongly associated with one individual sector. Stocks populating the center of the tetrahedron are conglomerates or diversified companies.

[0060] Also identified was that, for example, the emergent structure is amenable to a matrix factorization (archetypal analysis) that identifies the simplex corners as emergent sectors returns and decomposes each stock time series as a weighted sum of returns from the emergent sectors. This decomposition yielded new high-level insights about the nature of stocks returns and their quantifiable participation across sectors, in addition to granular insights about specific firms, revealing their exposure to returns from different sectors of the economy.

[0061] For example, the exemplary implementations provided a vivid insight to be gained into the evolving character of the sector participation of firms with different windows of time in the last two decades. As firms evolve and become exposed to different industrial sectors, this information is represented in its stock price returns which will show greater correlations with those industrial sectors. Therefore, any sector index should account for the dynamic nature of constituent firms and rebalance the portfolio allocation accordingly.

[0062] The disclosed technology is also capable of addressing survivorship bias, effects of sampling at different frequencies, and incorporating smaller market cap firms. The framework of understanding stock returns via an emergent structure of their data space also suggests development of a generative model. It is noted, for example, that investors and governments alike would benefit from the development of new investable sector indices that measure the health of the industrial sectors in a more principled manner as propounded in this document.

[0063] Exemplary Dataset Particulars

[0064] A wealth of financial data is freely available online via multiple sources. For the exemplary analysis described in this document, names, tickers, listed-sectors and market caps of US-based publicly traded companies were obtained from Scottrade. The following criteria were applied to company selection:

• July 2013 market capitalization over $1 billion.

· Registered domicile in US or Caribbean countries.

• Listed for trading on NASDAQ, NYSE, or NYSE MKT (formerly AMEX).

• Continuously traded for 20 years (beginning mid-1993).

[0065] The search filters yielded a list of N = 705 tickers for which adjusted daily closing prices were obtained from Yahoo! Finance using their API; the rare cases of missing or corrupted data points in the time series were replaced with linear interpolated values. A brief summary of listed sectors and number of companies in each is provided in Table 2 and a full list of company names, tickers, market caps and listed-sector info is available on the companion website [www.lassp.cornell.edu/sethna/Finance].

[0066] Returns Factorization and Sector Decomposition

[0067] The general problem of matrix factorization has received considerable attention in recent years and a variety of factorization algorithms have been developed with the goals of dimensional reduction classification or clustering. Examples include archetypal analysis (AA), heteroscedastic matrix factorization, binary matrix factorization, K-means clustering, simplex volume maximization, independent component analysis, non-negative matrix factorization (NMF) and its variants such as the semi- and convex-NMF, convex hull NMF and hierarchical convex NMF, among others. Each method has a unique interpretation and therefore, a successful application of any of these methods is contingent upon the underlying structure of the data. Exact definition of "returns" is given in the next section. Table 2

Listed sector Companies

Basic materials 58

Capital goods 61

Consumer cyclical 41

Consumer non-cyclical 40

Energy 42

Financial (+Real estate) 138

Healthcare 53

Services (+Retail) 101

Technology 93

Telecom 6

Utility 57

Transport 15

TOTAL 705

[0068] Table 2 shows an example of listed sectors and number of companies dataset analyzed. A full list of company names tickers, market caps and listed-sector information is available on the companion website. Tickers for each company were obtained from

Scottrade. Daily closing prices are adjusted for stock splits and dividend issues are obtained from Yahoo Finance. The rare cases of missing prices in the time series are replaced with linearly interpolated values.

[0069] The hyper-tetrahedral structure of log price returns of the disclosed technology and seen in the exemplary analysis motivates a decomposition so that each stock returns is a weighted mixture of canonical sectors:

R_ts = E_tf W_fs (1) Columns of R_tsC_Sf = E_tf are the emergent sector time series (basis vectors) representing the n corners of the hypertetrahedron, and Wf_S are the participation weights (Wf_S≥ 0) in sector / so that∑f Wf_S = 1 for each stock s. The sector matric E_tf is within the convex hull (C > 0_<∑s C_Sf = 1) of the data R_ts. The algorithm reduces dimensionality by representing each sample (e.g., here, each stock) as a convex combinations of extremes (called archetypes). The archetypes are the columns in the basis matrix E_tf and these can be found in multiple ways:

• Minimizing the squared error with convex constraints in factorization as

originally proposed. Making a convex hull of the dataset and choosing one or more of its vertices to be basis vectors, but this method would have serious computational limitations in high-dimensional data.

Making a convex hull in low-dimensions and choosing one or more of its vertices to be basis vectors.

• Minimize after initializing with candidate archetypes that are alternatively guaranteed to lie in the minimal convex set of the data.

• Fitting the smallest possible hyper-tetrahedron on the dataset.

In all but the last case above, for example, the archetypes are themselves chosen from the data: E_tf = R_tsr C_s'f, such that∑_s' C_s<f = 1 . The rows of the C matrix are shown in

FIG. 15: Canonical Sector Constituents (shown as columns of the C_sf). C_sf represents a weighted combination of stocks that defines the canonical sector each of which has a time series represented by i¾-that is given by E_t = R_tsC_sf . The eight subplots show the constituent participation component of stocks in each canonical sector / Canonical sectors are labeled on the plot; their names were chosen according to the listed sectors of firms that comprise them. Noteworthy features seen above include the co-association of listed sectors: basic, capital, transport and part of cyclicals into industrial goods. Similarly, healthcare and non-cyclicals are coupled together in what we call non-cyclicals. Canonical retail goes primarily with listed retail and cyclicals. Stocks are colored by listed sectors as shown at the bottom. Listed sector information was obtained from (1). Y-axis range is from 0 to 0.05.

[0070] In sum, AA is defined as a factorization with these properties:

[0071] Exemplary Calculations and Convergence

[0072] Numerical computations were performed using an in-house Python language implementation of the principal convex hull analysis (PCHA) algorithm. For the full dataset, the factorization R = EW, with E = RC as defined in Eq. 2 converged in 35 iterations to a predefined tolerance value of A_SSE < 10 ⁷, where ASSE is the average difference in sum of square error per matrix element in R - EW from one iteration to the next. The resulting columns of E_tf are shown in FIG. S4 (top row). Annualized cumulative log returns obtained by summing in rows of E_t .

The time series Qf (τ) are shown in FIG. 6 and middle row of FIG. 7. Weights Wf_s for selected stocks are shown in FIG. 5, the remainder are available on the companion website [www.lassp.cornell.edu/sethna/Finance]. In each canonical sector f, the component of weights for companies are shown in FIGS. 11 and 12. FIG. S5 shows an exemplary weight distribution in canonical sectors 300. Each of the eight subplots shows the constituent participation weights of all 705 companies in a canonical sector (rows of Wfs). Stocks are colored by listed sectors as shown at the bottom. Listed sector information includes Basic 1202, Capital 1204, Cyclical 1206, Energy 1208, Financial 1210, Health 1212, Non-cyclical 1214, Tech 1216, Telecom 1218, Services 1222, Real estate 1224, Retail 1226, Transport 1228.. Y-axis range is from 0 to 1.

[0073] The analysis of evolving sector weights was performed in a similar fashion, although with following differences. Each column (time series) of the returns matrix R_ts was multiplied with a Gaussian, 6_μ(τ) = exp (— _2x25q2 J of standard deviation 250 centered at μ to obtain With C_s,f found using the full dataset as in Eq. 2, is factorized to obtain new weights W _s that describe sector decomposition of stocks in that period focused at μ μ

t = μ: β^μ = R^,C_s,f W μ is increased in steps of 50 starting at ^ = 0 and ending at ^ =

5000 and \ν^μ is calculated at each μ with the corresponding β^μ. These results are plotted in FIG. 4 for a select group of companies, and the remainder are available on the companion website [www.lassp.cornell.edu/sethna/Finance].

[0074] Dimensionality of Space of Price Returns

[0075] The stock price returns have a dimension given by number of returns in the dataset. For the dataset used for the analysis described in this paper, 20 years of returns amount to a dimensionality of -5001 (as there are about 250 trading days per year). It is often the case with large datasets that the effective dimensionality of the data space is much lower when one filters out the noise. A number of dimensional reduction methods exist; the singular value decomposition (SVD) (c.f. principal component analysis) which is a deterministic matrix factorization, in particular yields an excellent separation of signal and noise from a dataset such as the one used in this analysis. This is discussed in more detail after introducing the following variables names and definitions. [0076] Let matrices p_TS and p„ represent prices and log prices respectively of stocks s at times τ. Log returns are then given by r_ts = Ρ(_τ+σ)₅— p_TS, where δ is the interval length over which returns are calculated. Define another matrix R_TS of normalized log returns: with

2 2 2 zero mean and unit standard deviation, R_TS = (r_ts— {r_ts)r / a_s, where = i^r _ts) t — i^rts) _t is the variance (squared volatility) of log returns.

[0077] FIG. 13 shows an exemplary plot of normalized distribution of singular values 1300. Filled blue histogram 1302 corresponds to distribution of singular values of returns from the dataset R_ts— one notices a clear separation of the hump-shaped bulk of singular values ascribed to random Gaussian noise, and about 20 stiff singular values (the largest singular value ~952, corresponding to the market mode is not shown). Pink line histogram outline 1304 shows the distribution of singular values of a matrix of the same shape as R but containing purely random Gaussian entries.

T

[0078] An SVD of R_ts is matrix factorization R_ts = U_ts∑ff,V _fl such that matrices U and V are orthonormal,∑ is a diagonal matrix of "singular values", n entries of∑ above a chosen noise threshold are retained and the rest truncated so that 0 < /, /' < n, effectively reducing the dimension of R to n. The choice of n is informed by the distribution of singular values. The rows of V^T are precisely the eigenvectors of the stock-stock returns correlation matrix, ^_ss' ~ff J_tff_ts. Some components of the stiff eigenvectors of this stock-stock correlation matrix loosely corresponds to firms belonging to the same conventionally identified business sector. See FIG. 2A.

[0079] After normalizing the log returns, the returns matrix R has entries of unit variance. If the entries were uncorrelated random variables drawn from a standard normal distribution, their singular values (which are also the positive square roots of the eigenvalues of R^TR) would be described by Wishart statistics. The Wishart ensemble for a matrix of size a x β predicts a distribution of singular values with a characteristic shape, bounded for large matrices by fa ± / ?. Comparing the stock correlations with Wishart statistics has been previously used to filter noise from financial datasets. As shown in FIG. 13, most singular values of the returns matrix R are associated with the random noise, whereas only ~20 fall outside that cutoff (The singular value bounds of a random Gaussian rectangular matrix of size a x β can be shown to be yfa ± / ? for large matrices.). The largest singular value of R_ts corresponds to what we will refer to as the "market mode" as this represents overall simultaneous rise and fall of stocks. In the analysis presented in this paper, this mode has been filtered from the returns matrix by projecting the R matrix into the subspace spanned by all non-market mode eigenvectors. This is nearly equivalent to filtering the market mode using simple linear regression, although more convenient. The distribution of singular values of a rectangular matrix A_xy with purely Gaussian random entries (alternatively, of the eigenvalues of a square matrix Α^ΊΑ) has a well characterized shape and has been previously used to filter noise from financial datasets.

[0080] Low-Dimensional Projections of Price Returns

[0081] A key discovery of the described technology is that the high-dimensional space of stock returns has an emergent low-dimensional hypertetrahedral (simplex) structure. The emergent low-dimensional, hyper-tetrahedral (simplex) structure of stock price returns can be seen by projecting the dataset into stiff "eigenplanes". This structure is clearly seen upon visualizing projecting dataset into "eigenplanes".

[0082] Eigenplanes are formed by pairs of right singular vectors from a SVD. The described technology can be used to construct an SVD of the simplex corners, E_tf =

X_tkYZ_kf ; map the simplex corners to columns of YZ^T because YZ_kf = X_ktE_tf (^m other words, X_kt is a projection operator). The plots in FIG. 3 are the projections of the dataset, X_ktR_ts = v_ks. The rows of υ taken in pairs form the axes of the projections in FIGS. 1 and 3. With those plots, it becomes clear that the eigenplanes represent projections of a simplex-like data into two-dimensions. Secondly, we note that the simplex structure becomes less clear as one looks at planes corresponding to smaller singular value directions; the signal eventually becomes buried in the noise.

[0083] Similarly, the results of the factorization can be seen in eigenplanes from the SVD of E_tfW_Sf = L_tkMN_ks. These results (rows of MN_ks.) are shown in FIG. 4, where we notice that the data is now perfectly resides in simplex region as expected due to constraints.

[0084] Proportion of Variation Explained (PVE)

[0085] The goodness of the returns decomposition R = EW is measured by measuring the proportion of variation explained (PVE) or the coefficient of determination (r²)as follows:

r² = PVE = 1 - SSE/SST (4)

[0086] Here, SSE is denotes the sum of square errors ||ff — EW\\ , and SST is the total

2

sum of squares ||ff || . For the full dataset factorized according to Eq. 2, normalized with the market mode removed, the calculated r² value or PVE = 1 1.1%. To put this number is context for the returns dataset, one must separate the variation in R ascribable to signal, and that to Gaussian fluctuations. The SVD of R with singular values shown in FIG. 13 provides a convenient way for doing so as follows. Only 20 singular values (excluding the market mode) were above the cut-off that was predicted by the random matrix theory for a matrix of

2 purely random Gaussian entries. For any matrix M with elements m_{i 7}-, the norm || || =

F

∑ijm^. =∑;S^, where s_t are the singular values. Thus, the fraction of intrinsic variation in R ij i

not attributable to noise is the sum of squares of the 20 singular values (not including market

- 2 2

mode) divided by SST,∑;Ξι° 5 . /\\R \\ = 19.8%. Therefore, as a first approximation, the

l

factorization explains 1 1.1/19.8 = 56% of the total explainable variation. The percentage of the total explainable variation for different numbers of factors compared to the 3 factor decomposition of Fama and French is shown in Table 3. Table 3 shows the percentage of the Explainable Variance captured by the disclosed model compared with the Fama and French factor model. Regression is done on the normalized dataset of 705 stocks without the market mode removed. To capture this, the market mode is added to factors obtained by the described decomposition.

[0087] It is also noted for completeness that if R is rank-reduced to eight stiffest components found by SVD (not including market mode), then the factorization of Eq. 2 explains 85% of the the total variation in R with overall results in good accord with the analysis presented here. This implies that sector decomposition information was already contained in the stiff modes from SVD of R, however SVD is not the appropriate tool for the decomposition. [0088] Determining the Number n of Canonical Sectors

[0089] It is an open problem to determine the effective dimensionality (optimal rank) of a general dataset (matrix). One could select among models of different dimensions using statistical tests such as the PVE discussed above, or information theory based criteria such as Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), but the choice of the selection criterion is itself generally made on an ad hoc basis. Therefore, the most direct observation of results is also the most reliable. In the dataset used for analysis described here, a factorization with n > 8 yielded results where both the emergent time series sland weights in Wf_s showed qualitative signs of overfitting. The high-level results of factorization with different values of n are discussed below.

• n = 9: Results in good agreement for n = 8, except one resulting sector was comprised of participation from only 10 seemingly unrelated stocks (r²=PVE= 12.4%).

• n = 7: Results were similar to n = 8, except c-real estate and c-financial merged into one canonical sector (r²=PVE= 10.7%).

• n = 6: Results are similar to n = 7, except listed retail companies divide into c-cyclical and c-non-cyclical sectors (r²=PVE= 9:9%).

• n = 5: Results are similar to n = 6, except c-cyclical and c-non-cyclical merge into one canonical sector (r²=PVE= 8:7%).

• n = 4: Results are similar to n = 5, except c-energy and c-utility merge into one canonical sector (r²=PVE= 6:9%).

• n = 3: All sectors overlap and there is no clear separation of companies.

(r²=PVE= 5.2%).

[0090] In general, a factorization analysis of the returns dataset would be sensitive to the following factors and care must be taken in order to interpret results:

• Number of stocks in the dataset.

• Criteria applied for picking stocks.

• Period over which historical prices are obtained.

• Frequency at which returns are computed.

[0091] A robust macroeconomic analysis would therefore require a large number of stocks chosen without sampling bias, with returns calculated over the period of interest and sensitivity checked for frequency of returns calculation (In general, the number of time points should exceed the number of stocks.). On the other hand, an equity fund manager faces a less daunting task for an analysis that is limited the universe of her portfolio of stocks: either to find its canonical sectors, or to analysis the exposure of her holdings to the core sectors of the economy.

[0092] Canonical Sector Indices

[0093] The matrix C_sf in decomposition in Equation in Eq. 2 represents how returns R of stocks 5 must be combined to make canonical sector returns E_tf= R_tsC_sf. Since an canonical sector is defined as a combination of stocks, an investment in the sector / can made via buying a basket of constituent stocks s in proportions given by G/or through an index I_tf :

where, p are stocks prices suitably weighted by market cap or other divisor as common practice for common indices. An unweighted index of this kind is shown in bottom row of FIG. 7 for results corresponding to the analysis described in this paper. Conversely, a predefined basket of stocks such as the S&P500 can be unbundled to find its exposure to the canonical sectors. With an investment strategy employing longs and shorts at the same time in correct proportions, it is conceivable to invest in, for example, the c-tech component of S&P500.

[0094] The desirable features of an index include completeness, objectivity and investability. The c-indices constructed using the ideas outlined here would not only be of value to investors through investment vehicles such as ETFs, Futures, etc., but also serve as important macroeconomic indicators.

[0095] FIG. 9 show an example of projections along eigenplanes of the normalized log price returns. Each colored circle represents a stock in the exemplary dataset is colored according to scheme in FIG. 6 based on the primary sector association found after calculations described in this paper.

[0096] FIG. 10 shows an example of projections along eigenplanes of the normalized log price returns. Each colored circle represents a stock in the exemplary dataset is colored according to scheme in FIG. 6 based on the primary sector association found after calculations described in this paper.

[0097] FIG. 4 shows an example of projections along eigenplanes of the factorized returns. Each colored circle represents a stock in the exemplary dataset is colored according to scheme in FIG. 6 based on the primary sector association found after calculations described in this paper. Black circles represent are the archetypes found with the exemplary analysis. [0098] FIG. 7 shows an exemplary diagram of canonical sector time series. Top row: normalized log returns (columns of E_tf), middle row: cumulative log returns (same as FIG. 5 as defined in Equation 3, and bottom row: unweighted price index of canonical sectors (Eq. 5).

[0099] FIGS. 11 and 12 show exemplary diagrams of weight distribution in canonical sectors. Each of the eight subplots shows the constituent participation weights of all 705 companies in an canonical sector (rows of Wf_S). Stocks are colored by listed sectors as shown at the bottom. Listed sector information was obtained from the www.scottrade.com.

[0100] FIGS. 2A show exemplary diagrams of singular vectors V _s of SVD of returns R_ts. The orthonormal right singular vectors (rows of V _s) of SVD of R_ts are equivalent to the eigenvectors of the stock-stock correlation matrix £_ss' ~ R^TR. Eight of these stiffest eigenvectors including the market mode are shown in rows of two at a time. Each has 705 components corresponding to stocks in an the dataset. The market mode with all components in the same direction describes overall fluctuations in the market; it was excluded from the analysis described in the paper. Previous work has suggested that each eigenvector of the stock-stock correlation matrix describes a listed sector, however as seen above, a more correct interpretation is that each eigenvector is a mixture of listed sectors with opposite signs in components. For example, the stiffest direction (after market mode) has positive components in real estate and utility, but negative in tech. Less stiff eigenvectors (including the last one shown here), do not contain sector-relevant information. Stocks are colored by listed sectors as shown at the bottom. Listed sector information was obtained from.

[0101] FIGS. 13 and 14 show exemplary diagrams of canonical sector constituents (e.g., shown as columns of the C_Sf). C_Sf represents a weighted combination stocks that defines of the canonical sector each of which has a time series represented by E_tf that is given by E_tf = R_tsC_Sf. The eight subplots show the constituent participation component of stocks in each canonical sector / Canonical sectors are labeled on the plot and include Basic 1402; Capital 1404, Cyclical 1406, Energy 1408, Financial 1410, Health 1412, Non-cyclical 1414, Tech 1416, Telecom 1418, Utility 1420, Service 1422, Real estate 1424, Retail 1426, Transport 1428. Their names were chosen according to the listed sectors of firms that comprise them. Noteworthy features seen above include the co-association of listed sectors: basic, capital, transport and part of cyclicals into industrial goods. Similarly, healthcare and non-cyclicals are coupled together in what is called non-cyclicals. Canonical retail goes primarily with listed retail and cyclicals. Stocks are colored by listed sectors as shown at the bottom.

[0102] The examples of implementations of the disclosed technology that provide unsupervised decomposition of the world economy into canonical sectors are both unexpected and likely fruitful. How does it compare with previous, supervised (hand-tuned) methods for decomposing stock prices into independent factors? The described 8 factor decomposition explains 11.1% of the total variation (r²) in the normalized returns with the market mode removed and 56% of the explainable variation as determined by random matrix theory. For comparison, the standard 3-factor decomposition of Fama and French (market mode, market capitalization, and growth/value) yields an r² value of 4.75%; we do substantially better at explaining the variation of individual stocks. When the disclosed method is used to create a three-factor decomposition, it yields an r² value of 5.61%; there appears to be no correspondence between the disclosed factors and the small-cap/large-cap distinction used as a factor by Fama and French (FIG. 16).

[0103] FIG. 16 shows an exemplary comparison of a 3 Factor Model vs. Fama and French 2D projections of the weights for each company in the SP500 with current tickers and data in the date range considered. Red 1602 denotes companies with large market caps (market cap >10 billion), blue 1604 denotes medium (market cap 2-10 billion) and green 1606 denotes small (market cap < 2 billion). For our decomposition (a), there is no separation

distinguishable by size of company. In comparison, for the Fama and French decomposition (b), there appears a gradation from large to small companies consistent with a factor of the model being related to size. (This is natural, since one of Fama and French's factors explicitly is the difference between large and small-cap returns). Thus our unsupervised 3-factor decomposition appears quite distinct from Fama and French's hand-created one.

[0104] Fama and French's factor analysis is usually used not for individual stocks, but to evaluate portfolios. Carrying out a regression on the SP500 yields an n value of 99.4% for Fama and French compared to 93.5% for the disclosed 8 factor decomposition with the market mode reintroduced; here Fama and French do substantially better. The described decomposition was optimized without concern for market capitalization, which appears to be the key difference. For an Equal Weighted Index of the 338 stocks in the S&P500 with current tickers and a complete data series in the disclosed time of interest, a spectacular r² value of 99.0% (97.0% for 3 factors) is obtained compared to a value of 95.8% for Fama and French. Thus the disclosed unsupervised learning method generates a factor decomposition that not only reveals the underlying structure of the disclosed economic system, but provides a competitive description of portfolio returns and a superior description of the returns of individual stocks.

[0105] Determining the correct number of canonical sectors that appropriately describe the space of stock market returns is akin to the more general issue of selecting a signal-to- noise ratio cutoff, or a truncation threshold in the dimensional-reduction of data. The choice of this threshold is generally sensitive to sampling, yet the results presented here are reasonably robust with different choices leading to meaningful and similar decompositions. In addition to the full data set of 20 years x 705 firms, we also applied an algorithm to overlapping, two-year Gaussian windows to study to how the sector weights for firms have evolved in time (see Fig. 8). As expected, the sector decomposition of firms is dynamic.

Mergers, acquisitions, spin-offs, new products, effect of competitive environments or shifting consumer preferences can change the business foci of firms and hence alter the sector association of firms. External events affecting companies in an idiosyncratic manner also show clear signature in this analysis. Future work remains to address survivorship bias, effects of sampling at different frequencies, and incorporating smaller market cap firms. The framework of understanding stock returns via an emergent structure of their data space also suggests development of a generative model. Lastly, investors and governments alike would benefit from the development of new investable sector indices that measure the health of the disclosed industrial sectors like macroeconomic indicators (GDP, housing starts, unemployment rate, etc.) measure the health of the broader economy.

[0106] The described technology can be used to address other financial survivorship bias, effects of sampling at different frequencies, and incorporating smaller market cap firms. The framework of understanding stock returns via an emergent structure of their data space also suggests development of a generative model. Lastly, investors and governments alike would benefit from the development of new investable sector indices that measure the health of the disclosed industrial sectors like macroeconomic indicators (GDP, housing starts, unemployment rate, etc.) measure the health of the disclosed broader economy.

[0107] Implementations of the subject matter and the functional operations described in this patent document can be implemented in various systems, digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible and non-transitory computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them. The term "data processing apparatus" encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0108] A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

[0109] The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).

[0110] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

[0111] While this patent document contain many specifics, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0112] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the

embodiments described in this patent document should not be understood as requiring such separation in all embodiments.

[0113] Only a few implementations and examples are described and other

implementations, enhancements and variations can be made based on what is described and illustrated in this patent document.

Claims

What is claimed are techniques and structures as described and shown, including: 1. A computer implemented method for classifying a financial asset in a financial market into sectors based on price returns of the stocks, the method comprising:

identifying sectors in the financial market; and

creating a weighted decomposition of the identified sectors for each financial asset in a group of financial assets by assigning weights denoting an extent to which each financial asset's price return is comprised of price returns of the identified sectors.

2. The computer implemented method of claim 1, wherein creating the weighted decomposition for each financial asset is performed using an unsupervised machine learning analysis of historical time series of each financial asset's price return.

3. The computer implemented method of claim 1, wherein creating the weighted decomposition for each financial asset includes:

taking a log price return of each financial asset;

removing an overall market return from a result of taking the log price return of each financial asset; and

normalizing a result of the removing the overall market return to zero mean and unit standard deviation.

4. The computer implemented method of claim 1, wherein results of the normalizing represent price returns for the group that are well-approximated by a self-organized simplex structure.

5. The computer implemented method of claim 4, wherein the simplex structure is a hyper-tetrahedral.

6. The computer implemented method of claim 5, wherein each lobe of the hyper- tetrahedron is populated by financial assets of similar or related businesses.

7. The computer implemented method of claim 5, wherein a lobe-corner approximates the returns of financial assets that are prototypical of individual sectors.

8. The computer implemented method of claim 1, wherein the sectors comprise canonical examples generated using an unsupervised, computer implemented algorithm from historical price return data to represent each financial asset as a weighted convex linear combination of the sectors.

9. The computer implemented method of claim 8, comprising:

creating a canonical index for a given canonical sector by selecting financial assets from the group based on the weighted decomposition of the financial assets.

10. The computer implemented method of claim 1, wherein the financial assets include stocks.

11. The computer implemented method of claim 1, wherein the financial assets includes bonds, currencies, or commodities.

12. The computer implemented method of claim 1, comprising:

obtaining annualized cumulative price returns for the identified sectors; and generating a time series using the obtained annualized cumulative price returns.

13. The computer implemented method of claim 12, comprising:

identifying features affecting different sectors based on the generated time series.

14. A computer implemented method for creating industrial sector financial indices in a financial market, the method comprising:

obtaining a price return for each financial asset in a group as a weighted combination of price returns from multiple sectors;

projecting the obtained price returns of the group of financial assets onto a price return space to generate a simplex structure representation, wherein lobe corners of the simplex structure representation correspond to the multiple sectors; and

creating an index of the financial assets in the group based on a desired mix of the sectors.

15. The computer implemented method of claim 14, wherein obtaining includes:

obtaining a log price return for each financial asset in the group; and

normalizing the obtained log price return for financial asset by a price return of the financial market.

16. The computer implemented method of claim 14, wherein the lobe corners are populated by prototypical financial assets.

17. The computer implemented method of claim 14, wherein the sectors comprise canonical examples generated using an unsupervised, computer implemented algorithm from historical price return data to represent each financial asset as a weighted convex linear combination of the sectors.

18. The computer implemented method of claim 14, wherein the financial assets include stocks.

19. The computer implemented method of claim 14, wherein the financial assets includes bonds, currencies, or commodities.