Developing a data science strategy within an organization of any size can be a major challenge. Executing on a strategy can be even more difficult, especially in larger organizations. However, if done well, a properly developed data science capability and executed strategy can materially impact an organization's ability to compete in terms of both growth and process optimization. In this post, I will present an approach to developing a data science strategy. But before diving into strategy development, I will spend time defining data science and the fundamental capabilities and work processes an organization needs to do it effectively.

Wikipedia defines data science as "a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data." Note that this definition does not focus on data set size, technology, or specific tools or types of algorithms. Instead, it emphasizes using science and structured processes to generate knowledge.

Now that a definition of data science has been established, we can discuss what fundamental building blocks are needed to build a data science capability and apply it effectively. There are three building blocks that any organization can invest in to construct a data science capability. An organization can be strategic in developing their data science capability by making conditional investments in these fundamentals based on competitive context, operating constraints and time horizon.

As the definition implies, data science skills are drawn from a variety of fields, including mathematics and computer science, as well as domain knowledge of a particular industry or business. Larger data science teams will also require skills in management and leadership to support scaling.

Technology enables the collection, storage, and organization of data as well as data analysis, machine learning, and dissemination and integration of information. As mentioned earlier, depending on context and constraints, the technology could be as simple as spreadsheet software or as complex as a big data, real-time cloud computing environment.

Of course, no data science capabiity can exist without meaningful data. It is critical to acquire data containing information that sufficiently characterizes the dynamics and structure of the marketplace and its actors, including customers, employees, and the processes they execute to do business and make decisions.

High level data science goals fall into two categories: the creation and acquisition of strategic knowledge and moving an organization from its current state towards an automated, adaptive decisioning state.

**Figure 1:** Organizational Decisioning State Space

Using these building blocks, a data science capability can be constructed by thinking about current strengths and weaknesses within an organization and how they map onto the space shown in Figure 1. The figure shows how investments in technology and skills enable use cases of increasing complexity. For example, small investments in both skills and technology are required to create a descriptive capability, which enables business intelligence and reporting. Significant investments in technology and skills enable robust automation of complex business processes. Data investments must precede investments in skills and technology represented in the Figure 1.

An effective data science strategy will transition an organization through this space by efficiently and iteratively investing in the fundamentals and then applying the data science capability to use cases that move it towards the prescriptive sub-space and as a result, evolve the organization into one with an automated, adaptive, decisioning engine.

**Figure 2:** Business or industry knowledge universe is represented by the circle. The colored areas represent acquired knowledge. Knowledge most common is closest to the center of the circle while scarcer and newer information is closer to the edge.

Another goal that an organization should strive to achieve is the strategic acquisition and generation of knowledge. Strategic knowledge enables the effective and efficient application of data science to business problems that produce meaningful differentiation in the market.

Matt Might's blog article and corresponding set of figures do a terrific job of explaining this type of goal through the lens of earning a Ph.D. Figure 2, was taken from his great article.

**Figure 3:** Data Science Development Lifecycle

Now that we have high level goals and a framework for guiding our investment in fundamentals to evolve an organization into an automated, adaptive decision making engine, we can turn our attention to a generalized work process that can be used for execution in any organization.

Before moving into a detailed explanation of the process, it is worth mentioning that successful execution generally requires a partership with a senior stakeholder who is accountable for capturing value from the process outcome.

The data and discovery phase of the lifecycle focuses on three objectives:

- Defining a problem to be solved or a question to be answered
- Defining and understanding the business process
- Acquiring and understanding the data that represents the business process

Once these objectives have been completed, a stakeholder check-in should be performed. During this check-in, the results of the data discovery and business process review should be presented. Using these results, a collective decision should be made to either stop the project or continue to the next phase.

In the modeling and analysis phase, the problem or question is solved or answered using historical data and a rigorous data analysis. This almost always produces three key outputs:

- The creation of a statistical model
- A prototype that uses the statistical model to capitalize on the answer to the question or solution to the problem
- An estimate of the prototype's value.

If the findings presented in the prior phase are valuable enough to warrant a change in operations, a formal experiment or pilot is conducted to ensure the results obtained using the historical data are obtainable in the present. This is a crucial step in the lifecycle, because it where the project is first exposed to change management. It is during this phase that a process is completely automated or people start changing what they do on a day to day basis, albeit on a controlled basis. If appropriate steps have been taken early on in the project, then the change management required to execute the experiment should be expected by the organization.

Finally, if the results of the pilot or experiment are meaningful enough to warrant a large scale change, then the business process adjustment as well as the prototype and statistical model created and tested, are moved into production and integrated with necessary business applications. System and model monitoring are put into place to track data and system performance through time. Business process monitoring is also enhanced to ensure that metrics and decisions are recorded and tracked through time.

Thus far we have discussed how to develop a data science capability by investing in the fundamental buildings of skills, technology, and data. We have also reviewed how to systematically apply a data science capability to a business problem to obtain meaningful value. Now, we are ready to review an approach for identifying problems of meaningful value to the organization, integrating solutions into the business, and tracking value through time.

Choosing where to focus and apply resources can be done by following the steps.

- Define the organization's revenue, product, and customer lifecycles
- Classify each stage in each of the lifecycles as either one that could or is a source of strategic differentiation or cost savings.
- At each stage, assess the quality of the leadership, state of the technology, process and data with respect to the goals.

To ensure that deployment can be done at scale, a limited set of patterns should be defined. These patterns should ensure that data science solutions can support both decision-support and pure automation patterns. Where integration is required, systems should be assessed to ensure they can support the chosen deployment and integration strategy.

**Figure 4:** Ray Dalio's representation of continuous improvement to producitivity.

Tracking the impact of data science in an organization is straightforward, if the development lifecycle is followed rigorously. Since each data science project begins by specifying a problem or question that can be measured and connected to an outcome within a business process, capturing the value of process changes becomes trivial. Following a successful project, the data science system should be enhanced continuously through time, using the original metric as a measure of impact.

When planned for and implemented properly, data science projects will produce meangingful value for an organization. In particular, when either building or applying a data science capability and strategy, successful implementation will:

- Rely on a structured process focused on knowledge discovery and hypothesis testing to enable long term success or rapid, controlled failure.
- Invest early in skills and data. Both are more resilient than technology.
- Ensure that a collaborative and cross-functional approach is applied to solving and answering all business problems and questions.

The old adage about investing successful investing goes, buy low, sell high. But today, with the stock market at all time highs and interest rates near historical lows, it seems challenging, if not impossible, to follow this sage advice. Therefore, I thought it would be informative to see what history can tell us about market returns when investing at all-time highs, holding for increasing periods of time, and then selling.

Using S&P 500 daily observations from 1950 through August of 2019, I analyzed index returns that would have been earned by investing at all-time market highs and then selling 1, 5, 10, and 20 years later. The results show that investment durations 10 and 20 years yield positive returns while the 1 year duration yields losses 26.4% and 10.3% of the time respectively. Furthermore, the magnitude of losses vary widely with about 40% being greater than 10%. When compared to investing in the S&P 500 when it is less than 95% of its historical all-time high value, we find that 1 and 5 year holding periods yield loss probabilities that are about 4 percentage points lower (22% and 6.4%). However, the probability of a loss at least 10% in magnitude increases by nearly 29%.

Overall, these results show that buying and holding over extended periods of time, at least roughly 10 years, would have avoided incurring investment losses under any starting conditions and support the idea that market timing was not a material factor for long-term investors during this time period.

For shorter time horizons, buying at times when the index was off its all-time highs led to a roughly 20% reduction in loss risk.

According to Wikipedia, "the S&P 500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. It is one of the most commonly followed equity indices, and many consider it to be one of the best representations of the U.S. stock market. The average annual total return of the index, including dividends, since inception in 1926 is 9.8%; however, there were several years where the index declined over 30%. The index has posted annual increases 70% of the time."

For this study, I collected 17,520 daily returns, excluding dividends, from the S&P 500, beginning on January 3, 1950 and ending on August 19, 2019, from Yahoo Finance. The data set, $D$, can be found here.

Each observation, $d \in D$, contains the following data points, summarizing intra-day value dynamics.

Date | Open | High | Low | Close | Adj Close | Volume |
---|---|---|---|---|---|---|

Aug 05, 2019 | 2,898.07 | 2,898.07 | 2,822.12 | 2,844.74 | 2,844.74 | 4,513,730,000 |

We identify historical all-time highs by sorting observations in time ascending order such that $d_i$ is the observation occuring the day before $d_{i+1}$. Then, for each $d_i$, we compare its adjusted close value, $c_i$, to the maximum adjusted close value of all days prior to $d_i$, $c_{max}$.

If $c_i > c_{max}$, we classify $d_i$ as an all-time high and compute the returns associated with holding a single share for 1, 5, 10, and 20 years. The return, $r_t$ for $t \in (1,5,10,20)$ is computed as the ratio of adjusted close values $t$ years in the future to the current all-time high, $c_{i + 365t} / c_{i}$. When $r_t > 0$, the investment yields a positive return and when $r_t < 0$, the investment yields a negative return.

The figure below overlays the occurances of the S&P's prior 1,242 all-time market highs with the daily adjusted closing values of the index.

For each of the all-time high observations, the returns for investment periods described above are computed. The figure below shows the cumulative distribution function (CDF) of return values for each of these. A critical value characterizing risk for an investor is the probability of loss. The CDF shows that durations 1 and 5 have loss probabilities of 0.264 and 0.103 respectively. Later durations, 10 and 20, experience no losses. In general, the historical data support the buy and hold rule of thumb used by many investors. That is, on a historical basis, investing in the index at all-time highs would have yielded no losses if the assets were held for 10 or 20 years.

For durations 1 and 5, it is worth investigating the magnitudes of losses. The figure below examines left tails of the CDFs plotted above to highlight the likelihoods of extreme losses of each holding period.

It is also informative to understand how buying at all-time highs compares to buying at other times. An identical analysis is conducted on 9,073 data points where $c_i \le 0.95 \cdot c_{max}$. 0.95 is chosen arbitrarily and one could search over the entire space if so inclined. The results, shown in charts below, are structually similar to what is observed in the all-time high data. However, they differ in two ways. First, in years 1 and 5, the probability of a loss decreases from 0.264 to 0.214 and 0.103 to 0.06 respectively. This makes sense intuitively, by avoiding all-time highs, an investor is also avoiding the chance of being forced to buy and sell at extreme highs and lows.

However, the probability of experiencing a loss of at least 10% is higher over both durations. The 1 year duration has a probability of loss greater than 10% equal to 0.131 while the 5 year is 0.024. The table below provides this information for a direct comparison.

Duration | All-Time High, at least 10% loss | Less than 95% of High, at least 10% loss | All-Time High, at least 20% loss | Less than 95% of High, at least 20% loss |
---|---|---|---|---|

1 | 0.114 | 0.131 | 0.031 | 0.070 |

5 | 0.031 | 0.024 | 0.013 | 0.015 |

These results show that buying and holding over extended periods of time, at least roughly 10 years, would have avoided incurring investment losses under any starting conditions - all-time highs or otherwise and support the idea that market timing was not a material factor for long-term investors during this time period.

For shorter time horizons, buying at times when the index was off its all-time highs led to a roughly 20% reduction in loss risk.

What can over five million monthly stock returns tell us about modern portfolio theory (MPT) and the capital asset pricing model (CAPM)? Mainly, that stocks do not have the properties necessary to support the conclusions of these theories. In this article, I present a cursory overview of MPT and the CAPM. Then I perform a simple statistical analysis that clearly shows stock returns do not completely satisfy the assumptions of MPT and CAPM.

Modern Portfolio Theory states that a portfolio of securities will produce an expected return \[ E(R_p) = \sum_i w_iE(R_i), \] where $w_i$ is the fraction of the portfolio invested in security $i$ producing returns $R_i$. It also states that the portfolio will produce a volatility \[ \sigma_p^2 = \sum_i \sum_j w_i w_j \sigma_i \sigma_j p_{ij}, \] where $w_{i}$ is again the fraction of the portfolio invested in security $Ri$ and, $\sigma_{i}$ is the standard deviation of security $R_i$ and $p_{ij}$ is the correlation coefficient.

From this theory, we can conclude that a portfolio with a set of securities that are not perfectly correlated with one another will produce returns with volatility that is less than the volatility of the individual returns. While this is a powerful framework, it relies on the underlying return generating processes to produce distributions with finite means and variances.

The Capital Asset Pricing Model (CAPM) is related Modern Portfolio Theory. It states that the expected return is equal to the risk-free market rate plus the difference between the asset's return and the risk-free market rate weighted by the expected sensitivity of the asset's returns to the overall market returns. Mathematically, this is expressed as \[ E(R_i) = R_f + \beta_i(R_i - R_f), \] where $R_f$ is the risk-free rate, $\beta_i = p_{im}\frac{\sigma_i}{\sigma_m}$ is the asset's sensitivity, and $R_i$ is the estimated asset return rate. As in modern portfolio theory, this model relies on the structure of the return rates to have finite and measurable means and variances.

I collected a data set consisting of monthly stock returns from January 31,1980 through January 31, 2017. During this 13,515 day period, there were 5,604,568 quotes from 45,872 stocks.

For each stock, I created a time series of monthly rates of return. The monthly rate of return, $R_i(t)$, for stock $i$, is computed as \[ R_i(t) = \frac{\Delta P_i}{P_i(t-1)}, \] where $\Delta P_i$ is the change in stock price for stock $i$ from month $t-1$ to $t$ and $P_i(t-1)$ is the previous month's stock price. The figure below plots a histogram of monthly stock returns.

Now that we have some data in good order we can move onto analysis. In this section, we are going to study the returns by comparing their structure to two parametric distributions. We choose the Gaussian distribution and a less common distribution, the Cauchy distribution. The Cauchy distribution is selected because of its relationship to the Gaussian and because it does not have a mean or a variance.

To fit a Gaussian distribution to data we need to estimate its mean, $\mu$, and variance, $\sigma^2$. The log-likelihood function of the Gaussian is expressed as \[ \mathcal{L}_{gaussian} (\{x\} | \hat\mu, \hat\sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\hat\sigma^2) - \frac{1}{2\hat\sigma^2} \sum_{i=1}^n(x_i - \hat\mu)^2. \] From this expression, the parameters can be estimated from the data using maximum likelihood. \[ \hat\mu = \frac{1}{n}\sum_i^n x_i = 0.002, \] \[ \hat \sigma^2 = \frac{1}{n}\sum_i^n (x_i - \hat \mu)^2 = 0.056. \]

The Cauchy distribution is another parametric continuous distribution with similar structure to the Gaussian. However, it has infinite mean and variance. The log-likelihood function for the distribution is expressed as \[ \mathcal{L}_{cauchy} (\{x\} | \hat x_o, \hat\gamma) = -n\log(\hat\gamma\pi) - \sum_{i=1}^n \log \Big(1 + (\frac{x_i-\hat x_o}{\hat\gamma})^2\Big). \]

While maximum likelihood methods can be used to estimate the parameters of the distribution from data, there is a a simpler approach. We compute $x_o$ by taking the median of the data {x}. \[ \hat x_o = \textrm{median} \{x\} = 0.00. \] We estimate $\gamma$ by taking the interquartile range of the data $\{x\}$. \[ \hat\gamma = CDF^{-1} (0.75) - CDF^{-1} (0.25) = 0.068. \]

Now that we have estimated the most likely distributions to have generated the data, we can compare them to determine which is a better. Before jumping straight to the test, let's visually inspect the distributions and the data. The following figure plots the full distributions of the Gaussian($\hat \mu$, $\hat \sigma^2$) and Cauchy($\hat x_o$ and $\hat \gamma$) on top the histogram of empirical data. It is clear to see that the Gaussian distribution estimates returns to have a wider range of returns with similar probability while the Cauchy estimates the returns to be in a narrower range closer to what is observed in the data.

Next, let's inspect the tails of the distributions. The two figures below show the binned data as well as the estimated probabilities of return at the left and right extremes respectively.

It is clear from the figures that the Gaussian distribution under estimates extreme events while the Cauchy distribution more closely follows the empirical observations.

Now that we have some intuition from the visual analysis, we can a apply more rigorous comparison via a likelihood ratio test. We perform the test by computing the log-likelihood for the Cauchy and Gaussian distributions using the equations noted earlier in the article. Then we take their ratio, \[ \mathcal{R} = \frac{ \mathcal{L}_{cauchy}(\{x\} | \hat x_o, \hat\gamma)} { \mathcal{L}_{gaussian} (\{x\} | \hat\mu, \hat\sigma^2)}. \] If the Cauchy distribution is a better fit to the data we expect to observe a large and positive number. Indeed, the result $\mathcal{R} = 23,525,726.265$, indicates that the Cauchy distribution is a better fit than the Gaussian.

From this result, we can call into question the validity of MPT and CAPM. As previously mentioned, both rely on finite measures of mean and variance and the data clearly show that a parametric distribution with infinite mean and variance fit the data better than a traditional Gaussian distribution. Moreover, if a Gaussian distribution is used to construct a portfolio, we have shown that it will systematically underestimate the extreme risk of a portfolio of domestic equities.

This cursory analysis motivates additional research into not only the statistical properties of market returns but also the generating processes that create such structure [1]. Relatively new research both inside and outside the confines of finance have begun to explore such approaches [2-4].

It is clear that we have much to learn and existing theory, while likely useful in specific instances, does not generalize to represent a unified and complete understanding of how market dynamics produce returns through time.

- Bernardo, JM and Bayarri, MJ and Berger, JO and Dawid, AP and Heckerman, D and Smith, AFM and West, M Generative or discriminative? getting the best of both worlds, Bayesian Stat 2007
- Taleb, Nassim Fooled by randomness: The hidden role of chance in life and in the markets Random House Incorporated, 2005
- Farmer, J Doyne and Patelli, Paolo and Zovko, Ilija I The predictive power of zero intelligence in financial markets, Proceedings of the national academy of sciences of the united states of america. 2005
- Galla, Tobias and Farmer, J Doyne Complex dynamics in learning complicated games, Proceedings of the National Academy of Sciences, 2013

The Bayesian approach to the classical Binomial test has been discussed in a few different places (see, here, here, here, and here ). Here, I thought I could provide a brief synopsis of the methodology and then share a javascript - d3 implementation that allows a user to supply a comma separated list of observations and then receive the results of the analysis through an explanatory visualization.

Given some data that represent samples drawn from two binary processes, we would like to understand if these two processes are equivalent. These processes could represent the results of flipping coins, users clicking on web ads, or people responding to email campaigns.

Mathematically, we can model each process using a Binomial distribution, \[ Binomial(n,k) = {n \choose k} p^k (1-p)^{n-k} \] where n corresponds to the number of trails, k the number of positive observations (or successes), and p is the probability of observing a positive observation.

Being a Bayesian approach, we can then place prior over the data. For simplicity, we'll choose an uninformative one. The Binomial's conjugate prior is the Beta distribution, \[ Beta(\alpha,\beta) = \frac{p^{\alpha-1}(1-p)^{\beta -1}}{B(\alpha,\beta)} \] where \(\alpha \) and \( \beta\) represent pseudo counts of prior observations. When both parameters are equal to 1, the distribution is uniform and thus, uninformative.

Since the Beta distribution is the conjugate of the Binomial, it is simple to obtain an analytical solution to the posterior for each process, we simply add up the observations and pseudo counts and use them as parameters for a new Beta distribution, \[ Beta(k+\alpha,n-k+\beta). \]

Thus far, we have only described a single process, but we are really interested in the difference between the two. To get a posterior that represents this difference, we can simply sample from each posterior, take the difference between samples, and study the structure of the resulting posterior.

To quantity uncertainty, Bayesians use credible intervals, which are practically similar (but by no means are they philosphically the same) to how Frequentists use confidence intervals. A 95% credible interval will contain those domain values that fall between the 0.0275 and 0.975 quantiles.

Now that we have reviewed the foundations of the Bayesian equivalent to the Binomial test and discussed how we can quantify the uncertainty in our estimate, we can use the test. For completeness, the code is available on Github here.

To use the tool, simply enter a comma separated list of observations for each group into the appropriate text area below and then click the Run Test button to see the results. A smoothed posterior distribution will be plotted, along with the bounds for a 95% credible interval.

Collaborative filtering is a technique for building recommender systems that relies only on a user's past behavior to make recommendations about what he or she should next in the future. There are two primary classes of solutions to the problem: nearest neighbor methods and latent factor models. Research has shown that latent factor models outperform nearest neighbor methods, so in this post, I'll work through implementing a latent factor model known as UV-factorization. UV-factorization is a type of matrix factorization, where a single large matrix is approximated by the product of two lower rank matrices.

Mathematically, the problem begins with a large matrix, known as a utility matrix, \(\boldsymbol{M}\), that contains behavioral data about individual users and the products or services they interact with. Each row in the matrix represents a user and each column represents a user's past interactions with specific products or services. Often times user interactions are characterized as ratings or purchases.

The solution to the problem is to determine which, out of all products each user has not interacted with, is the most likely to be preferred the next time the user chooses to perform an action, such as watching a movie or buying another pair of shoes. This is done by decomposing the utility matrix into two lower rank matrices, that when multiplied together, fill in the missing (user, item) pairs. \[ \boldsymbol{UV} = \boldsymbol{M} \]

To estimate \(\boldsymbol{U}\) and \(\boldsymbol{V}\), we can define our objective function in terms of minimizing the squared error between each row of M and the inner product of the corresponding rows of \(\boldsymbol{U}\) and \(\boldsymbol{V}\). To prevent overfitting, we can also apply a regularization term. Mathematically, our objective function is defined as, \[ \sum_{(i,j)} (\boldsymbol{M}_{ij} - \boldsymbol{V}_j^T \boldsymbol{U}_i)^2 + \lambda( \|\boldsymbol{V}_j\|^2 + \|\boldsymbol{U}_i\|^2). \]

To solve this objective, we can use stochastic gradient descent, originally presented here. After taking derivatives of the objective with respect to \(\boldsymbol{U} \) and \(\boldsymbol{V} \) we are left with the following pair of update rules: \[ \boldsymbol{U}_i \leftarrow \boldsymbol{U}_i + \alpha((\boldsymbol{M}_{ij} - \boldsymbol{V}_j^T \boldsymbol{U}_i)\boldsymbol{V}_j - \lambda \boldsymbol{U}_i) \\ \boldsymbol{V}_j \leftarrow \boldsymbol{V}_j + \alpha((\boldsymbol{M}_{ij} - \boldsymbol{U}_i^T \boldsymbol{V}_j)\boldsymbol{U}_i - \lambda \boldsymbol{V}_j), \] where \(\alpha\) is the learning rate parameter and \(\lambda\) is the regularization parameter. After estimating \(\boldsymbol{U}\) and \(\boldsymbol{V}\) by iterating over the known (\(i,j\)) pairs in the data, user \(i\)'s recommendation for product \(j\) can be estimated by computing \(\boldsymbol{U_i V_j^T}\).

As we've seen in other posts, implementing stochastic gradient descent is more or less trivial, once we have derived the update rules. The code below takes as input, a sparse scipy matrix and iterates over the data. The free parameters, f, lr and reg define the width of \(\boldsymbol{U}\) and \(\boldsymbol{V}\), the learning rate, \(\alpha\), and the regularization parameter, \(\lambda\), respectively.

def sgd_uv(util_mtx, f=5, lr=0.001, reg=0.1, max_iter=1000): #get dimensions of util_mtx, which is a compressed sparse row matrix r,c = util_mtx.shape #initialize item matrix v = np.random.rand(c,f) #initialize user matrix u = np.random.rand(r,f) #fit the matrices with a fixed number of iterations for c in xrange(max_iter): for i in xrange(r): for j in util_mtx[i].indices: err = util_mtx[i,j] - np.dot(v[j], u[i]) v[j] = v[j] + lr*(err*p[i]-reg*v[j]) u[i] = u[i] + lr*(err*v[j]-reg*u[i]) return u,v

To measure the algorithm's convergence, we can compute the root mean squared error (RMSE) after each iteration of SGD, shown below.

def rmse(util_mtx, u, v): e = 0. m = 0. r,c = util_mtx.shape for i in xrange(r): for j in util_mtx[i].indices: e += (util_mtx[i,j]-np.dot(v[j], u[i]))**2 m+=1 return np.sqrt(e/m)

In the code below, we create a sparse matrix, initialize it with random values, and then pass it to our learning algorithm.

#make a matrix A = scipy.sparse.lil_matrix((100, 100)) #fill some of with random numbers A[0, :10] = np.random.rand(10) A[1, 10:20] = A[0, :10] A.setdiag(np.random.rand(100)) #convert to compressed sparse row for quick row iterations A = A.tocsr() #fit the model u,v,err_arr = sgd_uv(A) #plot the error as a function of learning algorithm iteration plt.plot(err_arr, linewidth=2) plt.xlabel('Iteration, i', fontsize=20) plt.ylabel('RMSE', fontsize=20) plt.title('Components = '+str(j), fontsize=20) plt.show()

The figure below plots the RMSE as a function of SGD iteration. The exponential decay early in the iteration history and leveling off towards the end suggests the algorithm is converging on a solution, which may be a local one.

Note that to measure the performance of the algorithm, a fraction of the (\(i,j\)) pairs should be held out of the training process and be used as a test set, which I haven't done here.

In this post I've focused on a relatively simple objective function, to simplify the presentation. In reality, a more complex objective might be required to achieve a desired level of performance. These objectives might include some implicit features, like age and gender, to work around the cold start problem.

I took an interest in the multi-armed bandit problem after reading John Holland's book, Adaptation in Natural and Artificial Systems. Traditionally, the problem is described as follows: a gambler is putting coins into a slot machine (the bandit) with a number of arms, each of which has an independent probability of paying out a fixed reward. The gamber's objective is to learn, or explore, as little about the slot machine's arms as necessary in order to determine which has the highest probability of returning a reward. Then, using this knowledge, the gambler can capitalize on, or exploit, the arm that yields the most rewards by neglecting to pull all other arms. This trade-off between learning and capitalizing is known as exploration and exploitation.

The problem structure is general enough that it appears in a variety of domains, ranging from website testing, such as changing fonts or page structure to improve click through rates, to online advertising to robotics to dynamic pricing.

In what follows, I'll formalize a simple version of the problem, describe a modern solution, and then present an implementation and a visual simulation of the inner workings of the solution's algorithm.

In the simplest case, a bandit has \(k\) arms, each with it's own success probability, or bias. We will denote these success probabilities as \({\theta_0, \theta_1, \dots, \theta_k}\). When a bandit arm is pulled, it yields a reward \(y \in \{0,1\}\). Under these conditions each bandit arm is a Bernoulli random variable, parameterized by \(\theta\), where after some number of trials, the expected number of successes follows a Binomial distribution. In the literature, this formulation is known as the binomial bandit.

There are a rich set of solutions to the problem based on a a variety of techniques. The one I'm focusing on in this post, presented here, is a Bayesian variant of probability matching, also known as Thompson sampling, developed by Steven Scott.

To solve the binomial multi-armed bandit problem, our goal is to learn which arm has the highest success probability. Following Scott's method, we achieve this by conducting a series of trials, where, for each trial, we probablistically select an arm according to their estimated success probablities. From a Bayesian point of view, we can estimate each arm's success probability according to the following expression: \[ \Pr(\theta | \mathbf{y}) \propto \Pr(\mathbf{y} | \theta) \Pr(\theta), \] where \(\Pr(\mathbf{y} | \theta)\) is known as the likelihood function and \(\Pr(\theta)\) is known as the prior. Because each arm is a Bernoulli random variable, the likelihood function is equal to: \[ \Pr(\mathbf{y} | \theta) = \prod_{i=1}^{N_t} \theta^{y}(1-\theta)^{1-y}, \] which simplifies to \[ \Pr(\mathbf{y} | \theta) = \theta^{x_t}(1-\theta)^{N_t-x_t} \] where \(x_t\) is the total number of successful trials and \(N_t\) is the total number of trials conducted up to time \(t\). The prior distribution represents any knowledge known about the arms before each trial. The conjugate prior for the likelihood function we just derived above is the Beta distribution, defined as: \[ \Pr(\theta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}, \] where \(B(\alpha,\beta)\) is the Beta function, \(\alpha\) represents the number of previous successful trials and \(\beta\) represents the number of previous unsuccessful trials.

When the likelihood and prior functions are multiplied together to obtain the posterior, we get another Beta distribution, parameterized as follows: \[ \Pr(\theta | \mathbf{y}) = Be(\alpha+x_t, \beta+N_t-x_t) \]

Given this problem structure, Scott's solution proceeds as follows:

- Initialize the experiment by applying a uniform prior to each arm. This is equal to setting \(\alpha = \beta = 1\), for each arm's prior distribution.
- For each trial
- Choose an arm by sampling from each arm's posterior and selecting the arm with the largest \(\theta\).
- Pull the selected arm and get a reward.
- Update the arm's parameters by incrementing \(\alpha\) or \(\beta\) according to whether or not the trial produced a success or failure.

Over the course of the experiment, we would like to measure the performance of our method. As mentioned earlier, a quantitative measurement of performance is regret. Regret is the loss taken when one of the non-optimal arms is pulled during an experiment. If the truly optimal arm is known, mathematically, we can compute the total regret after each trial as follows: \[ R_t = \sum_{arms} n_t(\theta^* - \hat{\theta}_t), \] where \(n_t\) is equal to the number of trials allocated to each arm, \(\hat{\theta}\) is equal to the estimated success probability of the arm, and \(\theta^*\) is the optimal arm's success probability. Note that because we are representing each arm's success probability as a Beta distribution, we can compute \(\hat{\theta}\) by taking the distribution's expectation, which is equal to \[ \hat{\theta} = \frac{\alpha+x_t}{\alpha + \beta + N_t}. \] Optimal solutions, like the one presented here, to the multi-armed bandit problem exhibit have a total regret function that grows logarithmically with the number of trials.

If we use existing statistical libraries, the solution's implementation is trivial. In what follows I'll present a Python implementation. The complete code base for this post can be found here.

To begin, we need to create a data structure to maintain the state of each arm. As we showed earlier, each arm's posterior can be represented as a Beta distribution whose parameters are equal to the number of successes and failures. We can define a class to store these counts as follows:

class BetaDist(object): def __init__(self, alpha=1., beta=1.): #initialize the distribution to uniform self.alpha = alpha self.beta = betaThe constructor initializes the distribution such that it is uniform. Scott's algorithm states that after each trial we will need to update the distribution's counts of \(\alpha\) and \(\beta\), compute it's expectation, and draw samples from it. These methods are defined as:

def update(self, x): #update counts according to whether or not the trial was a success of failure if x == 0.: self.beta+=1. elif x == 1.: self.alpha+=1. def get_mean(self): #return the expectation of the distribution return self.alpha / (self.alpha + self.beta) def sample(self, samples): #draw some number of samples from the posterior and return them return scipy.stats.beta.rvs(self.alpha, self.beta, size=samples)

A bandit can be represented by instantiating a BetaDist class for each arm and defining methods that return relevant values for each step of Scott's algorithm.

class BernoulliBandit(object): def __init__(self, arms=2, samples=1): #initialize a number of beta distributions, one for each arm self.arms = [BetaDist() for i in xrange(arms)] self.samples = samples def update(self, arm, reward): #update the state of the relevant arm's posterior self.arms[arm].update(reward) def choose_arm(self): #probability sampling: estimate mean by sampling from the posterior #generated by trials, choose arm with the largest mean return np.argmax([arm.sample(samples=self.samples) for arm in self.arms]) def best_arm(self): #get the current best arm by returning the index of the largest #expectation return np.argmax([arm.get_mean() for arm in self.arms]) def get_mean(self, arm): #return the arm's expectation return self.arms[arm].get_mean()

To test this code, we can conduct a simple simulation where we know the values of actual arms and then measure how well the algorithm estimates which one is optimal by computing total regret. I've included this testing framework in the code repository under the file BernoulliBanditTest.py. Below is a simplified testing framework that does not include code to plot the evolution of the algorithm in terms of posterior distributions or total regret.

import numpy as np import scipy.stats from BernoulliBandit import BernoulliBandit trials = 100 arms = 3 regrets = [] #draw random biases for each of the arms on the true bandit true_arms = np.random.random_sample(arms) arm_choices = np.zeros(arms) #initialize random variables the true bandit using random bias values bandit = [scipy.stats.bernoulli(a) for a in true_arms] #initialize our model bb = BernoulliBandit(arms=arms) for t in xrange(trials): #choose a bandit probablistically, based on what we know so far arm = bb.choose_arm() #record choice for regret measurement arm_choices[arm]+=1 #get a reward from a single trial from the arm reward = bandit[arm].rvs() #estimate the regret for this trial regrets.append(regret(true_arms.argmax(), true_arms, arm_choices)) #update the model bb.update(arm,reward) #Print out summary of arm pulls, estimated biases and actual biases print 'arm biases ', true_arms print 'arm allocations ', arm_choices print 'best estimated arm ', bb.best_arm() print 'best actual arm ', true_arms.argmax()

In addition to the Python implementation I've reviewed above, I've developed a simple web-based simulator. To see the algorithm in action, click the button and watch how it evolves over 100 trials.

The top figure plots each arm's posterior distribution after each trial. Notice how at the beginning of the simulation each arm's distribution is the same and after each trial, the arm that was chosen has it's posterior adjusted according to whether or not the trial was successful.

The figure below the posterior plot shows the experiment's total regret as a function of trials. Notice that once the algorithm has built up a strong belief that one arm is optimal, the rate at which total regret grows decreases. The table show's the experiments true arm success probabilities as well as the number of trials allocated to each arm and the arm's total number of rewards.

Stochastic gradient descent is an extremely scalable learning algorithm, however it's sequential nature can become a bottleneck when processing large data sets. To work around this, researchers have developed parallelized versions of the algorithm. In this post, I'll briefly review stochastic gradient descent as it's applied to logistic regression, and then demonstrate how to implement a parallelized version in Python, based on a recent research paper.

In what follows, I'll be using notation introduced in a previous post.

Gradient descent is a simple, principled optimization algorithm used to choose parameter values in a variety of discriminative and generative predictive models. The algorithm works by efficiently searching a model's parameter space according to the following rule:

\[\begin{aligned} \theta := \theta -\alpha \nabla J(\theta). \end{aligned} \]

In the stochastic version, the parameters \(\theta\) are updated after analyzing each individual training example, as opposed to updating after a subset (mini-batch) or the entire set of (batch) training examples. I've written a previous post on batch gradient descent here.

The algorithm's scalability comes from only having to examine a single training example prior to updating. When the data are large, iterating over an entire training set, or even a small subset, can be an impassable bottleneck.

Logistic regression is a classification algorithm that fits a model to a set of features and targets according to the following relationship \[\begin{aligned} h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}. \end{aligned} \]

When this hypothesis function is inserted into our cost function, \[\begin{aligned} J(\theta) = \frac{1}{2}\sum_{i=1}^m (y_i - h_\theta(x_i))^2, \end{aligned} \] we obtain the following update rule \[\begin{aligned} \theta := \theta -\alpha \left(y_i-h_\theta(x_i)\right)x_i, \end{aligned} \] which is equal to \[\begin{aligned} \theta := \theta -\alpha \left(y_i-\frac{1}{1+e^{-\theta^Tx_i}}\right)x_i. \end{aligned} \]

Now that we have an update rule, we can implement the algorithm. The code below defines our hypothesis function, \(h\), and the stochastic gradient descent routine, \(sgd\). \(h\) takes as input, a single training example and the parameter vector \(\theta\), and returns the result of the hypothesis function we defined previously. The complete source code for this post can be found here.

For input, \(sgd\) takes the training set's features, \(x\), and targets, \(y\), along with a learning rate, \(a\), and a maximum number of iterations, \(max\_iter\). The algorithm first intializes the parmeters, \(\theta\) to random values and then makes \(max\_iter\) passes over the training set. At each pass, the training data is randomly permuted, after which the parameters \(\theta\) are updated after observing each individual training example. For this post, I am excluding a convergence criteria and instead iterating over the data the maximum number of iterations, as defined by the routine's input parameter.

from math import e import numpy as np def h(x,theta): return 1./(1.+e**(np.dot(x,-theta))) def sgd(x,y,a,max_iter): #initialize algorithm state m,n = x.shape theta = np.random.random(n) z = np.arange(m) #make max_iter passes over the training set for t in xrange(max_iter): #shuffle indices prior to searching z = np.random.permutation(z) #for each training example for i in z: #update parameters theta = theta + a*(y[i]-h(x[i],theta))*x[i] return theta

There are a variety of methods available that implement a parallelized version of stochastic gradient descent. The version we'll use here is based on a relatively new method that trains an ensemble of models in parallel, using stochastic gradient descent and distinct subsets of the training set, and then uses the average parameter values to make predictions. The paper describing this method along with its mathematical properties can be found here.

Implementing the parallelized version of stochastic gradient is trivial process, thanks to Python's multiprocessing library. All that is required is to partition the training set and define a function that takes a data structure containing the subset of training data and stochastic gradient descent parameters.

def train(input): x = input['x'] y = input['y'] a = input['learning_rate'] iters = input['iters'] return = sgd(x,y,a,max_iter=iters)

Then, using a process pool from multiprocessing, each subset of training data can be passed to each worker process and the results can be averaged and used to make predictions.

The code below demonstrates this by creating a syntheic data set using sci-kit learn, partitioning it into distinct subsets, training individual models in parallel, and then using the average to make predictions on the test set.

from multiprocessing import Pool from sklearn import metrics, datasets import numpy as np #learning rate a = 0.001 #create a synthetic data set, default features, 1500 examples, 2 classes x,y = datasets.make_classification(1500) #insert a 1's column to the beginning of x x = np.hstack((np.ones((x.shape[0],1)),x)) #partition the data input = [{'x':x[:250],'y':y[:250],'learning_rate':a,'iters':1000}, {'x':x[250:500],'y':y[250:500],'learning_rate':a,'iters':1000}, {'x':x[500:750],'y':y[500:750],'learning_rate':a,'iters':1000}, {'x':x[750:1000],'y':y[750:1000],'learning_rate':a,'iters':1000}] #worker pool pool = Pool(4) #estimate parameters for each model in parallel thetas = pool.map(train, input) #take the average theta = np.mean(thetas,axis=0) #make predictions on the test set pred = [h(x[i],theta) for i in xrange(1000,1500)] #get the roc curve fpr, tpr, thresholds = metrics.roc_curve(y[1000:], pred)

The figure below plots the ROC curves for parallel and serial versions of the stochastic gradient descent code, trained on the same data, indicating that they produce similar performance, just as the paper proves mathematically.

The advantage of running a parallelized version of stochastic gradient descent is it's speed. To measure the improvement that the parallel version produces I've written two test routines, one for parallel stochastic gradient descent and one for the sequential version. Measuing the run time of each method indicates the parallel version takes 4 times less execution time than the sequential version. This is expected, since I partitioned the data into 4 equal sized groups.

import timeit from multiprocessing import Pool from sklearn import metrics, datasets import numpy as np def test_parallel_sgd(): #learning rate a = 0.001 #create a synthetic data set, default features, 1500 examples, 2 classes x,y = datasets.make_classification(1500) #insert a 1's column to the beginning of x x = np.hstack((np.ones((x.shape[0],1)),x)) #partition the data input = [{'x':x[:250],'y':y[:250],'learning_rate':a,'iters':1000}, {'x':x[250:500],'y':y[250:500],'learning_rate':a,'iters':1000}, {'x':x[500:750],'y':y[500:750],'learning_rate':a,'iters':1000}, {'x':x[750:1000],'y':y[750:1000],'learning_rate':a,'iters':1000}] #worker pool pool = Pool(4) #estimate parameters for each model in parallel thetas = pool.map(train, input) def test_parallel_sgd(): #learning rate a = 0.001 #create a synthetic data set, default features, 1500 examples, 2 classes x,y = datasets.make_classification(1500) #insert a 1's column to the beginning of x x = np.hstack((np.ones((x.shape[0],1)),x)) #estimate parameters for each model in parallel thetas = sgd(x[:1000],y[:1000], a, 1000) if __name__=="__main__": t = timeit.Timer(lambda: test_parallel_sgd()) print t.timeit(number=1) #prints 2.25465488434 t = timeit.Timer(lambda: test_serial_sgd()) print t.timeit(number=1) #prints 8.18555998802

In this post I've demonstrated how to implement a parallel version of stochastic gradient descent in Python and shown that it performs nearly as well as the sequential version. The complete source code for this post can be found here.

In a previous post, I presented an analysis of an occupation network and how one might use a particular measure of network structure, eigenvector centrality, to identify important occupation titles. Here, I will present a visual method for performing the same task.

An occupation network is comprised of a set of occupation titles that are connected to one another. Titles are connected when a person transitions between them, for example through a career change or job promotion. Using this definition of connectedness and data from nearly 1,000,000 publicly available resumes, I built the network by extracting and interconnecting occupation titles according the job transitions recorded in the resumes. The network consists of roughly 300,000 vertices and 1,500,000 edges.

K-core decomposition is a method for breaking a network up into sets, or shells, according to degree. If we define a network as \(G = (V,E)\), where \(V\) is the set of vertices and \(E\) is the set of edges inter-connecting them, then a k-core is the subset of vertices in \(V\) with degree (total number of edges) \(\ge k\). The decomposition works by recursively removing vertices from the network that have a degree less than \(k\). The remaining vertices are placed in a set and the process is repeated until \(k=1\). The figure below shows a simple example.

Each set of vertices are then plotted in a circular fashion, according to geometric layout algorithm. Vertices with largest \(k\) are placed in the middle of the figure and vertices with decreasing \(k\) are placed further away from the middle, using a logarithmic scaling factor. The resulting plot allows for the identification of hierarchical structure and important vertices, according to their degree centrality. A detailed explanation of the algorithm can be found here. The software, Lanet, used to generate the visualization contained in this post can be found here.

The figure below presents a visualization (showing 1% of vertices) of the k-core decomposition of the occupation network. According to right-hand legend, the colors correspond to the size of k. Purple maps to \(k = 1\) and red maps to \(k \ge 118 \).

What can we learn about the network from the visualization? The most prominent feature is the hierarchical arrangement of distinct sets of colored vertices. Most occupation titles in the network have only a single edge, while a select few have 118 or more.

From a career path perspective, red colored occupation titles offer the most diverse set of next moves and also provide a central position within the network. Such a position offers relatively short paths to any other occupation title, when compared to vertices with lower degrees. For workers unsure of where to go next in their careers, targeting these central vertices as next positions provides the largest number of immediate future opportunities than others.

Which occupation titles are most central? For a list of the top 10, see this post.

These days, conceiving and executing a career path has never been more difficult, especially if you are unsure of exactly where you want to end up. For this reason I asked myself, how can millions of resumes and network science be used as tools that make this task easier and more efficient? In this post, I'll briefly present the results of some preliminary analysis that begins to shed light on the answer to this question. To get started, I'll introduce an occupation network, which I constructed from nearly 1,000,000 resumes. Then, I'll present the network's degree distribution and explain how to compute a simple, yet powerful network property called centrality. Then, I'll discuss how to interpret the measure when one is uncertain of where to start, or go next in their career. And finally, I'll present the top 10 occupation titles that offer the most diversity in future opportunities, ranked according to their centrality.

For readers only interested in results, skip to the top 10 career diversifying occupation titles.

Before we can discuss network measures, we need to define a standard way of representing them. Mathematically, a network can be represented as a matrix, denoted as \(\boldsymbol{A}\), and referred to as an adjacency matrix. Each entry in the matrix, \(A_{ij}\) indicates whether an edge connecting vertex \(i\) to vertex \(j\) exists. (The \(i\) and \(j\) in \(A_{ij}\) represent the respective row and column of the adjacency matrix.)

If all edges are considered to be equal in a network, then the values at each \(A_{i,j}\) entry are typically equal to 1 if an edge exists between the two vertices and 0 otherwise.

As an example, consider the undirected, unweighted, network shown above. An adjacency matrix representation of this network would have the form: \[\begin{aligned} \boldsymbol{A} = \begin{bmatrix} 0 & 1 & 0 & 0 & 1\\ 1 & 0 & 1 & 1 & 0\\ 0 & 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0\\ 1 & 0 & 0 & 0 & 0\\ \end{bmatrix} \textrm{ .} \end{aligned} \]

Over the course of a few months, I crawled the web and acquired nearly 1,000,000 publicly available resumes. From these resumes, I extracted the chronologically ordered sequences of occupation titles contained in the documents, among other career defining features such as education level and skills. I then normalized the job titles and constructed an occupation network where an edge is placed between two occupation titles, \(i\) and \(j\), if at least one resume contains a direct transition from occupation \(i\) to occupation \(j\). This procedure produced a network of roughly 300,000 occupation titles and just over 1,000,000 edges.

The first property I study when I analyze a network is the degree distribution. I do this for two reasons; its easy to compute, and the result sheds light on the network's structure and what types of dynamical processes may have created it. These results often motivate what analyses I conduct next.

Computing the degree distribution is straight forward. For each degree, which represents a vertex's number of incoming or outgoing edges, denoted as \(k\), we compute the fraction of vertices that have such a degree: \[\begin{aligned} \Pr(K = k) = \frac{n_k}{n} \end{aligned} \] where \(n_k\) represents the number of vertices with degree \(k\) and \(n\) represents the total number of vertices in the network. When a network is directed, the degree distribution is computed with respect to incoming edges, referred to as in-degree, and with respect to outgoing edges, referred to as out-degree.

The figure above plots the occupation network's in-degree and out-degree distrubtions and indicates that the majority of nodes in the network contain a small number of incoming and outgoing edges, while a select few possess the opposite trait - thousands of incoming and outgoing edges. This heavy tailed or scale-free pattern is found in many other real world social and technological networks and is often associated with heirarchical structure and complex behavioral or dynamical processes.

How might this type of complex network structure be exploited by career minded individuals? Those who are unsure of precisely what type of career they want to pursue, or where to go next, might be inclined to choose a position that does not constrain the diversity of their future opportunities. One way to satisfy this objective is to target occupation titles with high out degrees. By doing so, an uncertain job seeker gains the freedom and additional time to develop a more concrete plan for the future, without limiting their options and without stopping work. So which occupation titles have high degree? Ranking nodes according to their degree is a measure of network centrality, which we'll discuss next.

Centrality measures seek to quantify the importance of a vertex using a variety of metrics. In social contexts, a vertex with a high centrality score is interpretted as being influential in the network while a low centrality indicates the opposite. As I mentioned earlier, with respect to the occupation network, a vertex's centrality can be interpretted as how well it diversifies future career options.

We have already indirectly discussed the simplest of all centrality measures - degree centrality. To compute a vertex's degree centrality, one simply sums the number of edges leading to or from the vertex, depending on the desired direction. For an unweighted network, this is expressed mathematically, using the adjacency matrix, as, \[\begin{aligned} \textrm{Node i's out-degree centrality} = d_{i_{in}} = \sum_{j} \boldsymbol{A}_{ij} \end{aligned} \] \[\begin{aligned} \textrm{Node i's in-degree centrality} = d_{i_{out}} = \sum_{j} \boldsymbol{A}_{ji} . \end{aligned} \]

One limitation of degree centrality is that it places no weight on the centrality of neighboring vertices. For example, if a vertex has many neighbors that all have low degrees, the vertex will obtain a high degree centrality. On the other hand, if a node has only a few neighbors that point to it, but those neighbors have high centralities, the vertex is still given a low centrality. This type of effect is often times undesirable, especially with respect to occupation networks and maximizing the diversity of future career opportunities. Next, I'll present a simple metric designed to account for these properties.

Eigenvector centrality attempts to compensate for the short-comings found in degree centrality measures by assigning centrality values to vertices based on the centrality of their neighbors. Thus, if a vertex has only few neighbors, but those vertices are of high centrality, the vertex will have higher centrality score than it would if degree centrality was used. To compute vertex \(x_i\)'s eigenvector centrailty the centralities of its neighbors are tallied as follows \[\begin{aligned} x_i = \sum_j A_{ij}x_j. \end{aligned} \] For reader's wanting a more detailed explanation of how eigenvector centrality is computed, see this wikipedia page.

Which occupation titles in the network are the most diversifying? In the table below, I've listed the top 10 occupations, ranked according to their eigenvector centrality scores. For comparison, I've also included their degree centrality scores.

rank | occupation | eigenvector centrality | in-degree centrality | out-degree centrality |
---|---|---|---|---|

1 | owner | 0.131 | 5519 | 5063 |

2 | manager | 0.127 | 5098 | 5279 |

3 | customer service representative | 0.126 | 5097 | 6081 |

4 | administrative assistant | 0.123 | 5367 | 6153 |

5 | sales associate | 0.115 | 4043 | 4996 |

6 | office manager | 0.111 | 3742 | 4056 |

7 | supervisor | 0.109 | 3605 | 3804 |

8 | assistant manager | 0.108 | 3137 | 4037 |

9 | customer service | 0.105 | 3149 | 3582 |

10 | sales | 0.101 | 2902 | 3126 |

Overall, this list of occupation titles is customer service, management, and sales centric, as opposed to being science or engineering focused, which isn't surprising. Science and engineering oriented occupations often require highly specialized skill sets and higher levels of education than the occupations listed above, and as a result, they have smaller numbers of opportunities than occupations that require generic skill sets.

At this point you might be asking yourself, where do these occpuations lead? What skills, education, and experience do they require? How much do they pay? And perhaps more importantly, can they get me closer to achieving my long term career goals?

Hopefully this discussion of how network science can be successfully applied to career path dynamics has piqued your interest enough to try it for yourself. I'm in the early development stages of deploying much of what has been presented here to the optimization and prediction of individual career paths in the form of a web application, www.pathop.com. Consider creating an account and see what lies ahead in your career!

With the playoffs in full swing, I thought some analysis of the scoring dynamics in basketball would interest readers, so I've put together a comparison of two methods for predicting a game's winner - one developed by a serious sports fan and the other developed by myself, in collaboration with Aaron Clauset.

In 2008, Bill James published a formula for determining when a lead is safe in a basketball game, which I'm calling the Safe Lead method. Bill's calculation is as follows:

- Take the number of points one team is ahead and subtract 3.
- Add 1/2 a point if the team that is ahead has the ball, and subtract a 1/2 point if the other team has the ball. (Numbers less than zero become zero.)
- Square the result.
- If the squared result is greater than the number of seconds remaining in the game, the lead is safe.

Why does this formula work? And perhaps more importantly, how accurate is it? In this post, I'll shed some light on the accuracy of Bill's formula, and I'll also compare it to a method I've developed, which is based upon viewing a basketball game as a stochastic process.

At this point, readers that are only interested in the results might want to scroll to the end of this post.

For this analysis, a data set that describes when baskets were scored, and by which team, during a game, is required. Both the NBA and ESPN have made available, high resolution, play-by-play data of nearly every basketball game played in recent history. I've written a Python script to download and extract play-by-play data from ESPN, which can be found here.

I'm using data from the last 10 seasons of the NBA. In total, this is roughly 12,000 games and 1,200,000 scoring events (I'll define what a scoring event is shortly). For my analysis, I'm using a Python library I've created called SportsSciPy.

When the game of basketball is viewed as though it were a stochastic process, we can model the probability of the home team's score increasing by \(p\) points as the joint probability of observing a scoring event, the home team winning the event, and the event being worth \(p\) points. Mathematically, this can be expressed as \[\begin{aligned} \Pr(\Delta S_{home} = p) = \Pr(\textrm{event})(t)\Pr(\textrm{home team scores})\Pr(\textrm{points = }p). \end{aligned} \] I've used a similar model in previous research on modeling scoring dynamics in online games. In what follows, I'll analyze the data non-parametrically with respect to each of the three components introduced above: timing, scoring, and points. A paper that performs a more rigorous and parametric analysis of what I'll be presenting below can be found here.

To get started, we'll begin by studying the timing of scoring events. A scoring
event is defined as *the total number of points scored during a second of official
game play.* This means a series of foul shots scored in the same second of play are aggregated
into a single event worth a point value equal to the number of successful shots made by the player.
The figure below plots the probabilty of observing a scoring event at each second
of a game.

To fans, this figure should agree well with their intuition and understanding of the game. The 4 spikes in scoring at seconds 720, 1,440, 2,160, and 2,880 are the last seconds of play in each quarter of the game. In the first quarter, the probability of scoring is nearly constant once teams warm up and find their rhythm. Scoring in the second quarter steadily increases going into half-time. In the second half, scoring occurs with relatively constant probability, with the exception of roughly the last 30 seconds of the game, where teams pick up tempo to try and secure their win or reduce their deficit.

The figure and inset below plot the distribution of inter-arrival times between consecutive scoring events as well as their correlation. They tell us about two important features of the game. First, there is roughly a 30 second delay between consecutive baskets and second, that events are virutally uncorrelated.

These features suggest the timing of scoring events are independent, thus making it possible to estimate the number of scoring events remaining in a competition as follows

\[\begin{aligned} \textrm{Number of remaining scoring events, }n = \sum_{t=i}^{2880} \Pr(\textrm{event})(t). \end{aligned} \]Next, we'll examine how the probability of scoring correlates with a team's lead size. The figure below plots the probability of scoring given a lead size of \(l\) points.

Perhaps one of the most interesting features of basketball is that the probability of scoring again decreases as a team's lead increases. Some have spectulated that this negative correlation is due to the possesion change after each successful event. However, I have found that in other sports with the same rule, such as football, the probability of scoring increases with lead. If you have an opinion or some insight that may explain this phenonmenon, please share in the comments of this post.

Finally, we'll briefly examine the distribution of points scored over all the games in our data set. As expected, the majority of scoring events are worth 2 points.

Now that we have identified the data and methods for estimating each component of our scoring dynamics model, we can create a Markov chain that describes the evolution of a game's lead. For consistency, we'll let a game's lead, \(l\), be \(> 0\) when the home team is in the lead and \(l\) be \( < 0\) when the away team is in the lead.

The Markov chain's state space consists of score differences between the home and away team. The transition probabilities are equal to the joint probability of scoring given the lead size and the value of scoring event. These values can be stored in a matrix, denoted as \(P\), where \(P_{ij}\) is equal to the probability of transitioning from a current lead of \(i\) to future lead of \(j\). Mathematically the probability of transitioning from state \(i\) to state \(j\) is equal to

\[\begin{aligned} P_{ij} = \Pr(\textrm{scoring }| \textrm{ lead}) \Pr(\textrm{points} = p). \end{aligned} \]Given \(P\), all that is required to compute the probability of one team winning, given the game's current state, is to estimate the number of remaining events in the game, \(n\), raise \(P\) to this power, and then sum over all states that are either greater or less than 0 (depending on team). Mathematically this is equal to

\[\begin{aligned} \Pr(\textrm{home team wins } | \textrm{ }l,n) = \sum_{j=1}^k P^n_{lj}. \end{aligned} \]To measure the performance of the Markov chain's ability to predict a game's outcome, I've computed the timing, scoring, and points distributions using 3/4 of the game data and then predicted each remaining game's outcome as a function of cumulative number of scoring events. The figure below plots the model's prediction performance, using AUC, which can be interpretted as the accuracy of the classifier, as a function of cumulative events. If we assume a scoring event every 30 seconds, then the classifier operates at roughly 75% accuracy after 15 only minutes of play, and continues to improve as the game progresses.

So how does our Markov chain method of estimating a game's winner compare to Bill James' Safe Lead method? First, let's take a closer look at what Bill's method is telling us.

The Safe Lead method predicts whether or not the current leader of a game will lose their lead. This means that if Bill's method predicts a safe lead, the team in the lead will win the game. On the other hand, if Bill's method predicts the lead isn't safe, it does not mean that the team in the lead will lose the game. Rather, it tells us that the opposing team is likely to take over the lead at some point later in the game. This means we can't directly compare the prediction method shown above with the Safe Lead method.

To do a proper comparison using our Markov model, we need to calculate a slightly different estimate - the probability that the leading team changes at least once in the remaining \(n\) scoring events of the game. Mathematically, this is equivalent to the probability of the lead, \(l\), ending in a state \(j\), where \( 0 < j \le m\) if \( l < 0 \) or \( -m \le j < 0\) if \( l > 0 \), summed over the remaining steps in the game, \(n\). For example, if \(l < 0 \), then the probability of the leader changing at least once is equal to

\[\begin{aligned} \Pr(\textrm{leader changes} | \textrm{ }l,n) = \frac{1}{n}\sum_{i=1}^n \sum_{j=1}^m P^i_{lj}. \end{aligned} \]The figure below plots the AUC of each method as a function of time in the last quarter of the game. I've left off the first 3/4 of game time because the Safe Lead method doesn't do any better than chance.

From the results above, it's clear that the Markov chain classifier drastically
outperforms Bill's method. However, this isn't to say the Safe Lead method is bad or
incorrect. The main reason Safe Lead operates at a lower accuracy rate than the
Markov chain is because it's more conservative. That is to say, when the Safe
Lead method predicts a lead is safe, it's almost always right, but when it predicts
a lead isn't safe, it's often times incorrect. If you're familiar with
confusion matrices,
this means the Safe Lead method has a high true positive rate *and* a high false
negative rate.

In this post, I've presented a probablistic model of scoring dynamics in basketball and used it to construct a Markov chain that can make predictions about observing different types of scoring events in a game. I've also analyzed Bill James' Safe Lead method, presented a comparison to a Markov chain method, and showed that it has higher accuracy than the Safe Lead method.

For fans who like to watch streaming statistics of scoring events, I've created a web application, onthespotsports.com, that provides real time predictions for a variety of scoring events including the ones reviewed here. The site is viewed best on up-to-date versions of Chrome, Firefox, and Safari.