Language model (LM) focuses on estimating the following conditional probability
for an m-gram LM. This can be easily estimated from maximum likelihood as


where c(w_{1}^{m}) means the count of m-gram (w1, w2, …, wm). This is a good estimate when there is enough number of count. However, it become a problem if the m-gram is never seen in the training data (corpus). In this case, the maximum likelihood estimate sets the probability to 0, which is often too strict to be true as language is quite versatile.

To overcome this issue, a straight forward idea is that, if an m-gram does not exists, why not backing off and use (m-1)-gram instead, and recursively so on and so forth until ending at unigram. This is the basic idea of Katz backoff, but there is one issue here. Take a tri-gram LM for example. Let’s say that the particular tri-gram (“I”, “am”, “I”) never exist in our training data, which indicates that the probability for “I” to follow “I am” is very small and agrees with our intuition. However, if we back off to bi-gram, the probability of “I” to follow “am” becomes quite larger than it should be. In general, as eventually the back-off ends at unigram, even though P(w_{m}|w_{1}^{m-1}) is indeed 0, backing off guarantees it to be greater than 0.

On the one hand, a naïve back off makes the total probability does not add up to 1. On the other hand, an m-gram does not exist means it is not likely, and you need to respect this information instead of just tossing it away.

Good-Turning provides a method to estimate the probability of something that is never seen in the training data: the probability of something never seen is estimated by the probability of something that appears only once in the training data. This makes sense as the probability of something that appears only once provides an upper bound for something that is never seen yet. And then Good-Turning adjust the probability of something that appears once, twice, three-time, … so that they add up to 1.

Imaging a thought experiment. There is a box of balls with different colors. In order to know the distribution of colors in the box, we randomly pick a ball from box and then put it back, and repeat this for N times. The color we observe is denoted as l1, l2, …, lk. The number of frequency we observe for each color is c(l1), c(l2), … c(lk). And c(l1) + c(l2) + … + c(lk) = N. However, there is a chance that we never pick any ball with a rare color. As introduced above, Good-Turning methods estimate the probability of a ball with an unseen color to be the number of balls whose color is observed only once. Adjusting probability for seen color is more involved. First you need to build a frequency of frequency from c(l1), c(l2), … c(lk). Denote this as n(r), which means that there is n(r) number of c(li) equals r. The smoothed probability for color li is

P_s(l_i) = \frac{c^*(l_i)}{N},
where the adjusted count is
c^*(l_i) = (c(l_i) + 1)\frac{n(c(l_i)+1)}{n(c(l_i))}

Let’s walk through an example. Suppose the sequence of color we observe is

We can then count number of times each color observed as

Then get the count of count as
count of count

Finally, we can calculate the smoothed count as
smoothing calculation

Notice that the smoothed count for black color becomes 0. This is undesirable and in Katz method, smoothing is only done for items observed 1 to k times. For a frequent observed item (count > k), the probability is set to equal to the maximum likelihood estimate.


Imaging the following highly simplified broker-client market: at each time step, a client ask the broker for quotes; the broker provides a two-way quote — a bid price and an ask price for which the broker is willing to buy and sell respectively; the client can then choose to accept or reject.

To model the client, we assume a client is either a buyer or seller (excluding hedge funds that could flip between buy and sell after seeing the quote). Each time step, only a unit number of security could be transacted. Each buyer or seller has an opinion on the true price of the security. We can use two random variables \xi_b for buyer and and \xi_s for seller to collectively describe the distribution of true price they think. The buyer should on average expect a lower price than the seller (E(\xi_b) < E(\xi_s)), otherwise the market is not in equilibrium as the buyers will immediately buy as much as they can since the market mid is lower than the true price they think and then market will quickly reach equilibrium again. For now, assume the bid-ask spread is zero — that the broker is willing to both buy and sell at the same price X. In one time step, the probability that the dealer get a buy order is P(buyer) \cdot P(X \leq \xi_b), and getting a sell order is P(seller \cdot P(X \geq \xi_s), where P(buyer) and P(seller) is the probability of a random client being a buyer or seller respectively. An example is shown below in Fig. 1.


At the price where the blue and red cdfs intercept, it is the true mid price where
P(buyer) \cdot P(X_{mid} \leq \xi_b) = P(seller) \cdot P(X_{mid} \geq \xi_s).

This true mid X_{mid} is the most efficient price that facilitates the maximum transaction. Assume a different market in which at each time step, there is a grand auction to let buyers and seller to write down orders and the prices they are willing to transact. Then matching the orders will yield this mid X_{mid}.

It is also advantageous for the broker to quote this true mid for his own benefit as on average, the buy and sell flows will match. Otherwise, the broker’s inventory will grow linearly with time. In this idealized model, there are two challenges for being the broker:

a) How does the broker know the true mid?

Broker is not running a grand auction which gives him a full picture on the distrubitions of \xi_b and \xi_s. What he can do is to make quote and then see whether it gets accepted or rejected. In theory, he can make a series random quotes based on some prior distribution he thinks and then infer out the distributions from the results of the quotes. Dashed lines in Fig. 1 are the inferred distributions by running an expectation maximization algorithm on a Logistic mixture model. They do converge to the true distributions. However, the convergence is only reached until the quotes sample large enough range and in huge numbers. To most efficiently use the data, the broker has to use an algorithm to update the quote online.

b) How to keep inventory size down?

Even if the broker knows the true mid and therefore on average the buy and sell flows cancel, the random nature of the orders make the fluctuation–or standard deviation–of the inventory size grow proportional to the square root of time. To keep inventory size under a limit, the broker has to skew the mid. For example, skew the mid below the true mid so it is more likely for a buyer to accept to get rid of a positive inventory. However, as on average the security is transacted at the true mid, the broker will lose some money due to the skew. He will have to charge some bid-ask to compensate the skew.

Optimal strategy for the broker

We can come up with an optimal strategy for the broker by breaking the problem into two parts according to the two questions above. The first problem is an online measurement problem. The second is a stochastic optimal control problem. As a first attempt, we do not try to optimize the bid-ask spread but assume the broker charge a small fixed bid-ask spread. Optimizing the bid-ask spread has to take into consideration the competition with other brokers for market share, which is not captured in this model anyway.

a) Online measurement

The true mid X_{mid,t} is not directly observable. We make the usual assumption that it follows a martingale process as
X_{mid,t} = X_{mid,t-1} + \sigma_{mkt}\epsilon_{mid,t}, (Eq. 1)
where \epsilon_{mid,t} follows a standard normal distribution N(0,1) and \sigma_{mkt} is the volatility from market moves. Given an estimate of mid price at time t-1, the expectation of mid price does not change
E(X_{mid,t}|X_{mid,t-1}) = E(X_{mid,t-1}),
but the expectation of variance increases with time
E(var(X_{mid,t})|var(X_{mid,t-1})) = E(var(X_{mid,t-1})) + \sigma_{mkt}^2.

At time step t, after making a quote with mid price Q_t to the client, the broker receives response from the client. Denote the response as a random variable R_t \in \{-1,0,1\}, where -1, 0, 1 means the client will buy, do away, and sell respectively. Denote the skew as the difference between quote and true mid as s_t = Q_t - X_{mid, t}. The robust relationship that the broker relies on to get an idea on the true mid is that there is a positive correlation between s_t and R_t, which we can approximate by a linear regression relationship

s_t = \beta R_t + \epsilon_{obs,t}, (Eq. 2)

where \epsilon_{obs,t} is a random variable independent of s_t and with mean 0, variance \sigma_{obs}^2. Conditioned on the response at time R_t, we have
E(X_{mid,t}|s_t) = Q_t - \beta R_t
var(X_{mid,t}|s_t) = \sigma_{obs}^2
These can be used to update our estimate for mid at time t. In practice, the parameters \beta, \sigma_m and be estimated from a number of recent past quotes. We get the equations to update our estimates from t-1 to t:

\mu_t = \mu_{t-1} + K(Q_t - \beta R_t),
var_t = (1-K)(var_{t-1}+\sigma_{mkt}^2 ),

where \mu_t, var_t denote the estimates for mean and variance of X_{mid,t}, and
is the Kalman gain. Essentially, Eq. (1) is the state-transition model and Eq. (2) is the observation model. We use a Kalman filter to update our estimate for mid price online.

b) Stochastic optimal control

To make the problem analytically solvable, we approximate the our discrete model with a continuous one.
dI_t = ks_tdt + \sigma_I dW_t,
where I_t is the size of inventory, k is a coefficient related to 1/\beta in Eq. (2), W_t is a Brownian motion.
Suppose the cost function to optimize is
V(I_t, t) = \underset{s}{\min}\left\{ \intop_{t}^{T}(s_{t}^{2} + (rI_{t})^{2})dt\right\}
As the rate of accumulating inventory is linearly proportional to skew and the amount of money paid to offload the accumulated inventory is also proportional to skew, the s_t^2 is a probably a decent estimate of cost for skew being away from zero. The other cost on inventory I_t^2 is more hand-waving. It is just meant to be monotonically increasing with the magnitude of inventory.

Using Hamilton–Jacobi–Bellman equation, we get the optimal control
s_t = - r I_t,
and the associated cost function
V(I_t, t) = (I_t^2 - \sigma_I^2 t)r/k.
The results mean we simply skew the quote proportional to the size of the inventory. Under this optimal control, the cost is accumulating at a speed of \sigma_I^2 r/k.

Simulation results

We can simulate the client behavior and the broker’s strategy as described above. In this simulation, the buyer and sellers have equal population. The standard deviations for \xi_s, \xi_b are fixed to be 2; and on average buyer expects price to be lower than that of seller by 4: E(\xi_s-\xi_b) = 4. The true mid follows a Brownian motion. The broker quote, true mid and broker estimated mid are shown in Fig. 2. After an initial period (t<1e4) when the quote oscillates widely, the quote tracks the true mid price closely.

simulation quote

Inventory and PnL the broker makes are shown below. The inventory is usually below 10. During the time near t~8e4 when there is a stochastic trend going down, the inventory shoots to ~40, which also reflects as a drawdown in the PnL. The bid-ask spread is fixed at 0.75, at which the broker consistently makes a profit.

A zoom-in look at the quote below (Fig. 4 and 5) shows that the quote (estimated mid plus skew) actually tracks the true mid much better than the estimated mid. This is probably because in Eq. 2, s_t is the difference between the quote and the true mid. However in our simulation, the true mid is replaced by the estimated mid. This introduces an inconsistency. The skew proportional to the inventory size somehow corrects this inconsistency.


Fig. 4: Zoom-in of Fig. 2.


Fig. 5: Zoom-in of Fig. 2.

CPU/GPU benchmark on MNIST

Just want to give a try to the GPU after setting up Tensorflow. I took the convolutional neural network example from here, and run the benchmark using both CPU and GPU. The training time is compared here:

i7-7700K on all 4 cores: 1557 seconds = 26 min

GTX 1080: 80 seconds = 1.3 min

That is 19 times speed up!

1. CPU: i7-7700K $305

The first choice is between Xeon and i7. If you want to go multi CPUs, then the first choice is Xeon (There is the other option Ryzen but I don’t dare dive into). But then the whole budget will significantly increase as you get into the space of business computing. Building a higher end machine in the space marketed for gaming is more cost friendly.

The other choice that took me some time to make is between 7700K and 7800K. The latter has 6 cores whereas the former has 4. I could use more cores to tune parameters in parallel together. However, I chose to go with 7700K based on this key consideration: 7700K has integrated graphics whereas 7800K does not. As I plan to use the graphics card to train neural networks, and the display will freeze if the video is from the graphics card while it is doing heavy computing. This is highly undesired. Instead, I want to use the integrated graphics on CPU to output the video. This will save me some money buying a secondary graphics card solely for video display as well as PCI lane. This is another reason I choose i7 over Xeon as the latter does not have integrated graphics.

Update 2017-10-12: Now that i7-8700K comes out and you probably should consider this one instead.

2. Motherboard: ASUS PRIME Z270-P $106

The 7700K uses the Z270 chipset so it took me little time to pick the ASUS Z270 series. The only choice I made is between Z270-P and Z270-A, and the latter is about $50 more expensive. The only difference between them seems to be the Z270-A is designed to do over-clocking, which I don’t need.

3. CPU fan: Cooler Master Hyper 212 EVO $30

It is definitely powerful and makes very little noise. But I might go with the more compact one for the easier installation. Also, my very cheap case cannot close when I finished the build as this cooling tower is too tall LOL!

4. RAM: G.SKILL 16GB 288-Pin DDR4 3200 $130

RAM price has been increasing like crazy in 2017. I wish I had bought two of them.

5. SSD: Samsung 850 EVO 250GB $90

I had a 3T external drive so I only need a minimum SSD.

6. Power supply: Corsair CX Series 650 Watt 80 Plus

A 650W power supply should be enough for this build.

7. Ethernet cable: Amazon basics $5

Few motherboard comes with wifi on it. So I need an ethernet cable.

8.Phanteks Eclipse P400 ATX Tempered Glass” Edition $70

I got a cheap ($40) case in the first time and got millions of problem with it: the side lid cannot close because my CPU cooler is too tall and the case is too narrow; there is not a good place to mount SSD so I had to try some innovative way… Eventually, I gave up and bought a new one. I am really happy with this Phanteks case. The lesson is: don’t think the case is not important and try to save $20 on it.

9. GPU: MSI GTX 1080 AERO 8G OC $460

Waited for a few weeks and found a deal. One article explaining why there are so many different makes of Nvidia cards is here.

A list of website I found useful:

PCPARTPICKER: you can pick of list of components there; helps you compare price and roughly check compatibility

Tomshardware: Lots of lots of information on specific parts.

No video output problem checklist after finishing the build: Very useful especially for ones building the machine for the first time. My screen didn’t show anything in the first time because the RAM is not plugged in firm enough.

Last time I talked about how to understand overfitting from Haussler’s (1988) theorem. In the introduction, I used the figure from Wiki that is often used to explain overfitting. Why does this figure explain overfitting? There is more behind.


Fig. 1: Wiki’s illustration of over-fitting

The idea behind this figure is the frequentists’ view for model complexity: the more complex the model you have, the bigger variance is in your prediction. A big variance in prediction means that your model is very sensitive to the data. Let’s consider a regression problem similar to what this figure shows. Suppose that there is an underlying process that generates x and y, and you can keep sampling from the process. For example, the true model is y=0.1x^{2}+2x+1+\epsilon, where \epsilon is a random noise. After obtaining 50 samples (x,y), you move on to build a model. The sample looks like Fig. 2.


Fig. 2: samples

A simple model that is not likely to overfit is chosen to be y=ax+b. The result is shown in Fig 3. A complex model that is likely to overfit is chosen as =\sum_{i=0}^{15}c_{i}x^{i}. The result is shown in Fig. 4. Similar to the Wiki example, there is spurious oscillation, though not as wide as the wiki example as we are not matching every sample points exactly.


Fig. 3: Simple linear regression model


Fig. 4: More complex (but still linear) regression model.

As we can continue to sample the process, we can get multiple batches of samples that each contains 50 data points. Repeat the process again and again, the parameters (a,\ b,\ c_{i}) in our model will be generally different each time. How robust our predictions are? Figure 5 and 6 below show predictions from the simple model and that from complex model. We can see that there are more variations in the complex model.


Fig. 5: Simple linear regression results from multiple batches of samples.


Fig. 6: More complex regression results from multiple batches of samples.

To quantify this, we plot the standard deviation for predictions in the two models calculated from 500 batches. It is now clear from Fig. 7 that the complex model is less robust and has higher variance in prediction. The take away is that you can use the variance in prediction to quantify how complex a model is, given that the sample size is fixed. The bias-variance trade-off is the situation that a (possibly less complex) model often has more bias but also lower variance. The average of the predictions from the 500 batches for the simple and complex models are shown in Fig. 8. As expected, the ensemble mean prediction of the more complex model is less biased compared to the simpler model. Actually, it is very close to the true underlying model. This is not surprising as giving infinite data, our estimation for the more complex model should converge to the true model. However, the simple model would alway be biased because it does not have the quadratic term.


Fig. 7: Standard deviation in prediction for simple and complex models.


Fig. 8: Mean prediction of the 500 batches from the simple and complex models, as well as the true model.

The following figure is usually used to explain overfitting (By Ghiles from wiki).


In this figure, the points are training data. Two hypotheses, a linear model and a polynomial model, are fitted to the data. Although the polynomial model, being more complex, perfectly matches training data at every point, it is pretty obvious that the wild oscillations are spurious. It is much more likely that the simpler linear model describes the underlying “true” model better. The more complex polynomial model overfits the training data by picking up the random errors in the data. When a model is too complex, it often tends to overfit the training data. Why is that?

The Wikipedia page explains overfitting as a violation of Occam’s razor, which states that “among competing hypotheses, the one with the fewest assumptions should be selected”. This is rather a heuristic principle but perhaps has a profound philosophical reason. The pursuit of aesthetic hypothesis probably guides many discoveries in theoretical physics. In machine learning, however, the true model may not be aesthetic but may still be awfully complicated. Here is another way to understand why the simpler model is still preferred given similar performance on training data from Haussler’s (1988) theorem, which states that:

Consider the case that trained model has no error on the training
data of size m. The probability that this model has error rate
larger than \epsilon on unobserved data at most |H|e^{-\epsilon m},
where |H| is the number of hypothesis in the hypothesis space.
This can be written in formula as

P\left[\left(\exists h\in H\right)\ s.t.\ \left(error_{train}\left(h\right)=0\right)\wedge\left(error_{true}\left(h\right)>\epsilon\right)\right]\leq|H|e^{-\epsilon m}

To understand the meaning of hypothesis space and its size |H|,
let’s use an example. Consider the binary classification problem.
The input is \mathbf{X}=(x_{1},x_{2}) and x_{1},\ x_{2} are
integers from 1 to 10. The output to predict Y is 0 or 1. A hypothesis
is a mapping from \mathbf{X} to Y. A hypothesis space is a set
of these possible mappings. For example, a decision tree hypothesis
space H_{tree} can be shown in figure below. The parameters in
the models are C_{1}, C_{2}, C_{3} which takes values from
1 to 10, and D_{1}, D_{2}, D_{3}, D_{4}, which takes values
0 or 1. The number of hypothesis in |H_{tree}| is the number of
possible combinations of the parameters, therefore |H_{tree}|=10^{3}\times2^{4}.


Consider the most exhaustive hypothesis space H_{exh} that is all
the possible mappings from \mathbf{X} to Y. In this case, |H_{exh}|=2^{100},
which is much much much larger than |H_{tree}|.

If a learning algorithm selects a hypothesis from H_{tree} that
achieves zero error in training samples, and let’s put some numbers
in, say the sample size is 1000, the probability of true error larger
than 5\% is less than 3e-18 given by Haussler’s (1988) theorem. However,
if the learning algorithm selects a hypothesis from H_{exh}, the
theorem gives a bound of 2.4e8, which is useless. The reason is that
it is too easy to come up with a hypothesis from H_{exh} that achieves
zero training error: just remember all the training samples. Therefore from Haussler’s (1988) theorem, we see why a more complex hypothesis space (a more complex model) tends to have a higher generalization error — overfitting!

The GARCH model here is:
r_t = \mu + \epsilon_t
\epsilon_t = \sigma_t e_t
\sigma^2_t = \omega + \alpha \epsilon_t^2 + \beta \sigma^2_{t-1}

This says that the variance of the return follows an AR(1) process. I think this model as a prism to understand volatility. The volatility itself is not observable. You may calculate “historical volatility” from the standard deviation of return in a period of time in the past. But how long the period to choose is rather arbitrary: should it be a month or 3 months? If this period is too long, your calculated “historical volatility” may not show a spike although the market has already crashed. Whereas for the GARCH model, the hyperparameters can be hopefully obtained by some sort of cross-validation. In addition, I would expect a large deviation between r_t with its mean \mu to be more quickly translates into an increase in volatility in the GARCH model. So the volatility inferred from GARCH model could be more instantaneous and thus closer to my intuition.

With a few lines of code, I can calibrate a GARCH model to SPX return and get the conditional volatility from it.

import datetime as dt
import pandas_datareader.data as web
from arch import arch_model
start = dt.datetime(2000, 1, 1)
sp500 = web.DataReader('^GSPC', 'yahoo', start=start)
returns = 100 * sp500['Adj Close'].pct_change().dropna()
am = arch.arch_model(y=returns, mean='Constant', vol='GARCH', p=1, q=1, hold_back=10)
res = am.fit()
cond_vol = sqrt(250.)*res.conditional_volatility

Notice that I annualize the daily volatility into annualized volatility by multiplying sqrt(250). The conditional volatility obtained from GARCH model, along with 20 days (trading day) historical vol and VIX are shown below. All three are very similar in the first look:


If we zoom into some volatility spikes, for example, between Aug and Sep 2015,


VIX spikes before both conditional volatility from GARCH and the historical volatility. However, the conditional volatility seems to be more responsive than historical volatility. This can be more clearly seen from the lagged correlation between these volatilities:


On average, VIX moves about 2~3 days ahead of conditional volatility, and ~5 days ahead of historical volatility.