Mathematics Puzzle: Adding Random Numbers

Posted on February 28, 2008
Filed Under Math Puzzles | 6 Comments

I have no intention of turning this blog into a math blog, but every now and then I run into interesting mathematical puzzles. My two favorite fields are combinatorics and probability, and this puzzle belongs to one of them:

Suppose you start adding up random numbers chosen from the interval [0,1]. How many numbers, on the average, would you have to add before their sum becomes equal to or exceeds one? 

(Hint: the answer is NOT 2 … )

Physics Puzzle: Newton’s Cradle

Posted on February 1, 2008
Filed Under Physics Puzzles | 6 Comments

We’ve all seen this executive toy somewhere:

Newton’s Cradle

As the ball comes down and a collision occurs, it comes to rest while another ball at the other end rises up. This repeats ad infinitum - at least in theory - and illustrates perfectly the principle of conservation of momentum - the momentum of the colliding ball on one end is “transferred” to the ball at the other end, because it can’t just “go away”.

Let’s imagine a different scenario. Let all of our balls be of identical size and shape, each having mass m. Now suppose that the ball coming down (in the photo above) is of mass 2m instead of m. Is seems that conservation of momentum would dictate that this time, TWO balls should rise up on the other end.

Is this correct, or is there some hidden fallacy in the argument?

Wavelets: Analyzing Scale

Posted on February 1, 2008
Filed Under Signal Processing | Leave a Comment

Physics is littered with transforms: Fourier, Radon, Wavelet, Laplace, Hilbert and many more. The idea behind each transform is - almost exclusively - to represent data in a way which makes it easy to “see” hidden structures. For instance, the Fourier transform allows you to see if there is any periodicity within your signal quite easily.

In this post I’ll tell you about another way of looking at hidden structure: Wavelets. Well, actually, it’s a pretty big field, so it’s impossible to go into the details, so I’ll just be giving you the general idea of things. Simply put, the Wavelet transform is a way of spotting different scales in your data.

For example, suppose you’re analyzing an earthquake. It is well known that massive earthquakes are preceded by weaker ones (which are preceded by yet weaker ones, ad infinitum - well, in theory at least). So you’ve got a phenomenon which replicates itself on several scales. This makes it quite amendable to Wavelet analysis (that is, analyzing the seismographic signals using Wavelet transforms).

Another example might be a face recognition software. Faces have all sorts of features to them of varying scale - your forehead and cheeks are on a different scale compared to, say, your teeth, which are on a different scale compared, say, to your mustache’s hairs. A good face recognition software would probably ignore details on a very fine scale. Using the Wavelet transform you could easily “discard” those details - in essence, smooth your image quite nicely - before running any tests. Wavelets are also a great way of compressing images. In fact, the FBI chose to store its fingerprint database using Wavelet compression techniques, which have proven very fast and effective, with a high compression ratio.

The Haar Transform

The name “Wavelet Transform” is a bit misleading because there are quite a few of them. We’ll be examining a particular one, called the Haar Transform, which is probably the simplest Wavelet transform. Suppose you have some signal, which we will choose to represent using a vector:

 \mathbf{s} = (s_1, s_2, s_3, ...., s_N)

We can decompose this input signal into two sub-signals by taking the sums and differences of our original vector:

 \mathbf{s}_+ = \frac{1}{\sqrt{2}} (s_2 + s_1, s_4 + s_3, s_6 + s_5, ..., s_{N} + s_{N+1}) \\ \mathbf{s}_- = \frac{1}{\sqrt{2}} (s_2 - s_1, s_4 - s_3, s_6 - s_5, ..., s_{N} - s_{N-1})

The meaning of these two signals (which are each half the length of the original signal) is as follows:

Thus, we’ve extracted two parts from our signal - a slow varying, smooth part, (\mathbf{s}_+), and a part which contains all of the fine details (\mathbf{s}_-). It is quite trivial to go back from this point - that is, reconstruct our original signal, given \mathbf{s}_+ and \mathbf{s}_-. Just add and subtract \mathbf{s}_+ and \mathbf{s}_- and do a bit of algebra, and you’re good to go.

But what’s science without a bit of fancy jargon to back it up? \mathbf{s}_+ is often called the trend, and \mathbf{s}_- is often called the fluctuation of our signal.

Haar Transform, Level 2

What we’ve just done in the previous part is called a level-1 Haar transform. We can now take our trend and once again “chop it up” into its own trend and fluctuation. This would be a level-2 Haar transform. This, in theory, can go on as long as we have signal left - recall that each trend is half the length of the previous one:

Haar Transform

That’s all there is to it, really - you’ve just performed your first Wavelet transform. The fluctuations allow us to observe the fine details of our signal at increasing scale. This is the essence of the Wavelet transform.

This scheme can also be used to compress the signal - by truncating the Wavelet transform at a particular level and keeping merely the trend, we’re saving a compressed, smoothed version of that signal (the compression is lossy). Such compression methods are quite effective, as noted previously.

Wavelets in Reality

Where are wavelets used in reality? Here are just a few examples.

Well, I’m no wavelets-expert, so I’ll be wrapping up my discussion at this point. I hope I’ve given you an idea of the power and usefulness of wavelets. See you next post!

Google’s PageRank (Sort of) Explained

Posted on December 21, 2007
Filed Under Algorithms, Computer Science, Mathematics | Leave a Comment

In this post we’ll take a look at the algorithm which defines Google’s pagerank (PR) analysis. A webpage’s PR is a number between 0 and 10 that Google uses to estimate the usefulness of that page. For example, CNN.com has a PR of 9. A “typical” web site might have a PR of 5. Pages with spam or malicious content get a PR of 0, also known as the page rank of death, feared by all spammers worldwide.

So, how does Google go about determining the rank of a page? The idea is as follows. Suppose you were looking for a doctor to treat some disease you have. You would probably go about asking for recommendations from different people you know. The doctor who got the most recommendations would be the one, right? Well, almost. You would naturally weight each reference by the importance of the person who gave it. The recommendation of your brother, who is a doctor himself, will probably be given a more significant weight than, say, the recommendation given to you by your friend who is really just a farmer. The same idea governs the PR algorithm used by Google.

Let’s take a simple example which will make things clearer. Take a look at the “mini-Internet” I’ve drawn in this figure:

Miniature Internet Model: The Internet as a Graph

My mini-Internet has 3 sites, labeled A, B, and C. I’ve drawn them as a graph. The sites link to each other using regular hyperlinks, which I drew using arrows. An arrow between site A and B means that A has a link leading to site B.

The importance of a site, according to the Google PR algorithm, is obtained by considering each site that links to it. Each referring site’s contribution to this “importance score” is proportional to the number of total outgoing links of the referring site, times to the importance of the linking site. Thus, the more a certain site “recommends” other sites (i.e., links to other sites), the less important is each recommendation (each link). This is much like in real life: the words of a person who speaks little have more weight attached to them than the words of a person who speaks a lot.

The situation described above may seem cyclical (a site’s importance is determined by the importance of the sites linking to it), but it becomes quite straightforward when put into equations. Let’s try it. We denote by P_i the page rank of site i, where i=A,B,C. We also denote by N_i the number of outgoing links from site i, so N_A = 2, N_B = 2, N_C = 1. The page rank of each site is given by:

 P_i = \sum_j \frac{P_j}{N_j}

where the sum ranges over all sites j linking to site i. Explicitly, we have for our mini-Internet:

P_A = \frac{P_B}{N_B} \\ P_B = \frac{P_A}{N_A} + \frac{P_C}{N_C} \\ P_C = \frac{P_A}{N_A} + \frac{P_B}{N_B}

(note, for example, that site C does not appear in the first equation because C doesn’t link to A.)
Plugging numbers into the N’s:

P_A = \frac{P_B}{2} \\ P_B = \frac{P_A}{2} + \frac{P_C}{1} \\ P_C = \frac{P_A}{2} + \frac{P_B}{2}

Those of you who are keen on linear algebra might write this using matrix notation as follows:

\left( P_A, P_B, P_C \right) = \left( P_A, P_B, P_C \right) \left( \begin{array}{ccc} 0 & \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} \\ 0 & 1 & 0<tex> \end{array} \right) ” class=”simple” /></tex></p>
<p><img src=

“AHA!” says you, “an eigenvalue equation!” - and, indeed, what we have obtained here is a standard eigenvalue equation of the sort you meet in linear algebra every day. Basically what Google does in order to compute a page’s PR is solve this eigenvalue equation.

Reality Check

In reality, though, things are somewhat more complicated. First of all, the web consists of billions of pages - about 8 last time I Googled the number (no pun intended) - and solving an eigenvalue with an 8,000,000,000 x 8,000,000,000 matrix is quite insane. Google uses an iterative solution, which computes the pageranks in several attempts, improving precision with each iteration, until a certain threshold is reached. Furthermore storing and manipulating such large matrices is an issue all in itself, and one needs to simply glance at a typical Google server installation to get a drift of the computing power needed to get things done (about 450,000, according to a recent New York Times estimate. Yep, you read that number right!).

Improvements on Basic Notion

The basic PR algorithm outlined above is a much simplified version of the one Google uses. The actual algorithm is a trade secret, but here are some potential problems which require a modification of the above scheme:

There are of course many other things to consider - I am no Google expert, mind you - but I hope I have given you a small taste of a very concrete application of linear algebra to our modern world.

Physics Puzzle: a Race Between Two Marbles

Posted on December 17, 2007
Filed Under Physics Puzzles | 11 Comments

For our weekly puzzle, here is a physics brainteaser to test your mechanics skills (did I mention already those are my favorite puzzles?)

A race between marbles

Consider a marble which rolls from start to finish along one of two paths: either an inclined, yet straight plane (B), or a series of valleys and hills passing below that plane (A). Assuming no friction and that the marble stays on its path at all times (and doesn’t, say, hit some bump and gets thrown up in the air), which path do you predict will take the least amount of time? Or, simply put, which marble, A or B, will reach the finish line first, and why?

Added 11/2/2008: A few clarifications.

Neural Networks Tutorial, Part #3

Posted on December 13, 2007
Filed Under Algorithms | Leave a Comment

In our previous tutorial we’ve laid out the basic form of our two layer feed-forward neural network (FFNN). In this installment we’ll derive a way of training it. Just to remind you, here is the basic outline of our neural network, along with all relevant variables:

A Two-Layer Feed Forward Neural Network

In general, when training a network, one prepares a set of inputs, x^\mu and a corresponding set of outputs, z^\mu. The superscript \mu denotes an input-output set. Note that each x^\mu is a vector, with components x_i^\mu:

x^\mu = \left( \begin{array}{c} x_1^\mu \\ x_2^\mu \\ . . . \\ x_N^\mu \end{array} \right)

In this example, N denotes the number of inputs.

In reality, however, when we feed the x^\mu's into our FFNN, the corresponding outputs \bar{s}^\mu don’t often match our expectations, z^\mu. We can quantity the total error from comparing all possible outputs to the desirable outputs by defining an error function as follows:

E = \frac{1}{2} \sum_{\mu, k} \left( z_k^\mu - \bar{s}_k^\mu \right)^2

The double sum here ranges over all input-output pairs, and over all components. We would like to minimize E - ideally make it zero, or at least as small as posible. We do this by employing a gradient descent method - at each iteration of our algorithm we compute the derivatives of E with respect to the bias vectors and weight matrices, and slightly change them in a manner proportional to the derivatives. In other words:

\begin{align}\delta w_{ij} = - \epsilon \frac{ \partial E}{\partial w_{ij}} \\ \delta \bar{w}_{ij} = - \epsilon \frac{ \partial E}{\partial \bar{w}_{ij}}  \\ \delta b_i = - \epsilon \frac{ \partial E}{\partial b_i} \\ \delta \bar{b}_i = - \epsilon \frac{ \partial E}{\partial \bar{b}_i} \end{align}

where \epsilon is some small number, the choice of which is somewhat of an art and some trial-and-error (more on that later). This method of adjusting our weights and biases is termed backpropagation, for a reason that will be made evident at the end of this post. Meanwhile, we need to get down and busy with computing the above derivatives. Be warned, things are going to get quite messy. For a firm understanding, I suggest following them on your own with pen and paper.

Computing the Derivatives: Output Layer

Our main tool in this computation will be the chain rule, which I’ll assume you’re familiar with. Let us start with the “easier” derivatives first:

\frac{\partial E}{\partial \bar{w}_{ij}} = \sum_{k,\mu} \left( \bar{s}_k^\mu - z_k^\mu \right) \frac{\partial \bar{s}_k^\mu}{\partial \bar{w}_{ij}}

Since \bar{s}_k^\mu = \bar{f} (\bar{h}_k^\mu}), we employ our chain rule as follows:

\frac{\partial E}{\partial \bar{w}_{ij}} = \sum_{k,\mu} \left( \bar{s}_k^\mu - z_k^\mu \right) \bar{f}'(\bar{h}_k^\mu) \frac{\partial \bar{h}_k^\mu}{\partial \bar{w}_{ij}}

Since \bar{h}_k^\mu = \sum_p \bar{w}_{kp} s_p + \bar{b}_k, we can immediately infer  \frac{\partial \bar{h}_k^\mu}{\partial \bar{w}_{ij}} = \delta_{ki} s_j, where \delta_{ij} is the Kronecker delta function, equal to 1 if i=j and 0 otherwise. Substituting and summing over k:

\frac{\partial E}{\partial \bar{w}_{ij}} = \sum_{k,\mu} \left( \bar{s}_k^\mu - z_k^\mu \right) \bar{f}'(\bar{h}_k^\mu)   \delta_{ki} s_j  = \sum_{\mu} \left( \bar{s}_i^\mu - z_i^\mu \right) \bar{f}'(\bar{h}_i^\mu) s_j

The quantities that appear in this equation are all directly computable by taking the input (for a particular \mu) and propagating it forwards in our neural networks. We’ll see how it’s actually done in an upcoming installment when we implement things in MATLAB, but for the time being, I’d just like to introduce a notation:

\bar{e}_k^\mu \equiv \left( \bar{s}_k^\mu - z_k^\mu \right)

This is the error in the output. Using this definition:

\frac{\partial E}{\partial \bar{w}_{ij}} = \sum_{\mu} \bar{e}_i^\mu \bar{f}'(\bar{h}_i^\mu) s_j

The computation of the bias is done similarly, and is even easier (you should try it yourself before reading further):

\frac{\partial E}{\partial \bar{b}_i} = \sum_{\mu_k} \bar{e}_k^\mu \bar{f}'(\bar{h}_k^\mu) \frac{\partial \bar{h}_k^\mu}{\partial b_i} = \sum_\mu \bar{e}_i^\mu \bar{f}'(\bar{h}_i^\mu)

where we have used:

\frac{\partial \bar{h}_k^\mu}{\partial \bar{b}_i} = \delta_{ki}

Computing the Derivatives: Input Layer

We now turn to the task of computing the derivatives relating to the first, input layer. The method is similar, only now we need to apply the chain rule twice. Here is how it’s done. The first step is a rehash of the computations we did above:

\frac{\partial E}{\partial w_{ij}} = \sum_{\mu, k} \left(  z_k^\mu - \bar{s}_k^\mu \right) \frac{\partial \bar{s}_k^\mu}{\partial w_{ij}}

Proceeding as above,

\frac{\partial E}{\partial w_{ij}} = \sum_{\mu, k} \left(  z_k^\mu - \bar{s}_k^\mu \right) \bar{f}' \left( \bar{h}_k^\mu \right) \frac{\partial \bar{h}_k^\mu}{\partial w_{ij}}

We now re-apply our chain rule as follows: first of all, note that

\bar{h}_k^\mu = \sum_p \bar{w}_{kp} s_p + \bar{b}_k \\ s_p^\mu = f(h_p^\mu) = f \left( \sum_q w_{pq} h_q^\mu + b_p \right)

and that

\frac{\partial h_p^\mu }{\partial w_{ij}} =  \frac{\partial }{\partial w_{ij}}  \left( \sum_q w_{pq} x_q^\mu + b_p \right) =  \delta_{pi} x_j^\mu

and hence:

\frac{\partial \bar{h}_k^\mu}{\partial w_{ij}} = \sum_p \bar{w}_{kp} \frac{\partial \bar{s}_p^\mu}{\partial w_{ij}} = \sum_p \bar{w}_{kp} f' \left( h_p^\mu \right) \frac{\partial h_p^\mu }{\partial w_{ij}} = \bar{w}_{ki} f' \left( h_i^\mu \right) x_j^\mu

Substituting this back into our main derivation, we obtain:

\frac{\partial E}{\partial w_{ij}} = \sum_{\mu} \left[ \sum_k \bar{e}_k^\mu \bar{f}' \left( \bar{h}_k^\mu \right)  \bar{w}_{ki} \right] f' \left( h_i^\mu \right) x_j^\mu

Formally speaking, this looks exactly like our expression for \frac{\partial E}{\partial \bar{w}_{ij}} derived above provided we define the error in the first layer’s outputs as:

 e_i^\mu \equiv \sum_k \bar{e}_k^\mu \bar{f}' \left( \bar{h}_k^\mu \right)  \bar{w}_{ki}

which yields the compact expression (compare with :

\frac{\partial E}{\partial w_{ij}} = \sum_{\mu} e_i^\mu f' \left( h_i^\mu \right) x_j^\mu

I will leave it up to you as an exercise (don’t you just hate it when someone says that? ;) ) to show that:

\frac{\partial E}{\partial b_j} = \sum_{\mu} e_j^\mu f' \left( h_j^\mu \right)

 Now What?

Having computed the derivatives, all we need to do is adjust our weights and biases according to some gradient descent method. In the next installment I will give a “cookbook recipe” for doing that - no need to turn any mental gears. After that we’ll discuss how to initialize the weights and biases in the first place, and conclude with a discussion of some MATLAB code that will put the concepts in this tutorial to use. Until next time - have fun!

Physics Puzzle: Rope Between Two Poles

Posted on December 10, 2007
Filed Under Physics Puzzles | 1 Comment

A good puzzle, someone once told me, is one that can be solved by a bright high-schooler, and yet challenge even a seasoned practitioner. As a puzzle-aficionado, I second that notion, which is why mechanics puzzles are my favorites. It is the most intuitive discipline and the first to be studied in both high school and university, and yet the complexity of the phenomena it describes can often be quite astounding. Here is a puzzle to test your knowledge of mechanics:

Rope Between Two Poles

A rope hangs between two poles. Can you calculate the tension in the rope at (a.) the lowest point, and (b.) at the points at which it connects to the poles? For the purposes of the question, assume the rope has some mass, m, which is uniformly distributed. Take L to be its length. You may assume you know the distance between the poles.

Note: although my graphical skills don’t necessarily portray this, the problem is symmetrical (i.e. the right and left poles are the same, and the rope is attached to both at the same height … ;) )

Neural Networks Tutorial, Part #2

Posted on December 7, 2007
Filed Under Algorithms | 1 Comment

A Short Review

In the previous part of our tutorial, I’ve spoken about feed forward neural networks (FFNN), and said that the basic building block of a FFNN is a layer, which can be concisely summarized using the following diagram:

Feed Forward Neural Network Layer

A layer has N inputs and M outputs (schematically represented here by a single line at the input and a single line at the output), and is characterized by three parameters: a weight matrix W, a bias vector b, and a transfer function, also called an input function, f. If the input of such a layer is given by the vector x = (x1, x2, …, xN), then the output, y=(y1,y2,…,yM) will be given by:

y = f(Wx + b)

There are a few input functions that are used repeatedly throughout the literature. Here are the most common 3:

Let’s Get Practical

We will be studying a two-layer FFNN. Our goal will be to “teach it” the function f(x) = x^2. We will do this by training the network on a set of inputs and outputs, and then hope it will manage to infer the function at all remaining points. Ok, beating around the bush. It’s time to introduce some notations. Here is a schematic representation of our network:

A Two-Layer Feed Forward Neural Network

I’ve introduced a few notations in this diagram. Let’s go over them.

Note thatf, \bar{f}, W, \bar{W}, b, \bar{b} are all distinct variables. Although we will eventually choose f to be a sigmoid and \bar{f} to be linear, we won’t need these assumptions to derive our conclusions below.

Training Our Network

As noted earlier,our final goal will be to train our network to mimic the function f(x) = x^2. This means that we will “feed” our network with various inputs - say, x=1, 2, 3, 4, 5, etc. - and observe the outputs. Ideally, when inputting x, we’d like to see the network output its square - e.g., when inputting 7, we’d like to get 49 as our output. Obviously, this will not happen by default. We will need to observe the actual outputs, and adjust W, b, \bar{W}, \bar{b} appropriately, until the actual outputs approximate the desired outputs closely enough. This is known as training the network.
How do we go about training our network? Fortunately for us, there is a nifty algorithm known as backpropagation which was invented for training FFNN. We will cover backpropagation in our next installment, so stay tuned!

MATLAB Speed: Using the M-Lint Utility

Posted on December 4, 2007
Filed Under matlab | Leave a Comment

A great deal of my work is done in MATLAB these days, and I come across many neat little features of the language and the IDE that are hidden among the menus. Sure, you’d know all about them if you read the manual, but who has time for that?

Here is a little trick to help you optimize your code. MATLAB comes with a utility called M-Lint that analyzes your code and tries to help you make it run faster. To enable it, in the main IDE, select the Desktop menu, -> Current Directory, to make the directory bar appear. This allows you  to browse your projects and files easily:

Matlab IDE

Once you’ve enabled the directory view, scroll to the directory containing your project (your m-files). Next, select the M-Lint report generator from a tab that’s practically hidden in the directory view (fortunately for you I’m pointing it out):

M-Lint Report Generator

Try finding that on your own! Once you select it, it automatically processes all of the m-files in the current directory and generates a report on each one:

M-Lint report

The report is quite detailed, specifying line numbers where your code might be improved and suggesting how (and why!) it should be altered. This can even teach you quite a few things you’re not even aware of!

The Pendulum and Truck Puzzle

Posted on December 2, 2007
Filed Under Physics Puzzles | 3 Comments

Still working on the next part of our neural networks tutorial, but in the meantime, a puzzle.

A stationary truck has a pendulum of length L and mass m attached to the top of its insides (see figure A below). The truck driver hits the acceleration and as a result everything sort of gets “thrown back” inside the track - pendulum included (see figure B below). The question is, can the calculate the maximal angle the pendulum reaches as it is thrown back? Neglect air friction, or any other type of friction in this problem.

Pendulum and Truck Puzzle

Enjoy!

« go backkeep looking »