Uniqueness of hyperbolic geodesics

The Poincare’ model of hyperbolic geometry consists of the unit disk $\mathbb{D}=\{z\in\mathbb{C}: |z|<1\}$ where the geodesics are arcs of circles perpendicular to the unit circle $\partial\mathbb{D}=\{|z|=1\}$. It turns out that given two points $z,w\in\mathbb{D}$ there is a unique circle orthogonal to $\partial\mathbb{D}$ passing through $z$ and $w$.

A nice way to prove the uniqueness of geodesics is to use the stereographic projection. Let $\mathcal{S}$ be the unit sphere in $\mathbb{R}^3$ and $N$ its north pole. The stereographic projection puts in one-to-one correspondence each point $z\in\mathbb{C}$ with the unique point $z^*\in \mathcal{S}\setminus\{N\}$ that lies on the line through $N$ and $z$. For example, the unit disk $\mathbb{D}$ corresponds to the southern hemisphere on the sphere $\mathcal{S}$.

One of the main properties of the stereographic projection is that it is a conformal map, i.e., it preserves angles between curves. Proving this is a nice exercise in geometry.

Another property is that the stereographic projection maps circles to “circles”, in the sense that circles on $\mathcal{S}$ through the north pole are mapped to lines in $\mathbb{C}$ and those that don’t go through $N$ are mapped to actual circles in $\mathbb{C}$.

Armed with these two facts we can now easily establish the uniqueness of hyperbolic geodesics. Given two points $z,w\in\mathbb{D}$ consider their images $z^*, w^*$ on the southern hemisphere of $\mathcal{S}$. It is clear that $z^*$ and $w^*$ do not lie on a vertical line. Therefore, there is a unique plane through $z^*$ and $w^*$ perpendicular to the $(x,y)$-plane in $\mathbb{R}^3$, i.e. $\mathbb{C}$. The intersection of this plane with $\mathcal{S}$ is a circle that is perpendicular to the equator of $\mathcal{S}$. Under the stereographic projection it will be mapped to a “circle” through the points $z,w$ and, by conformality, its image will be orthogonal to the image of the equator, i.e., $\partial\mathbb{D}$ (the equator is fixed by the projection).

Modulus of Path Families on Graphs

These notes approximately follow a presentation Mario Bonk gave at the Workshop on Discrete and Complex Analysis, Montana State University, July 19-23, 2010. We follow the paper of Haïssinsky “Empilements de cercles et modules combinatoires”. Ann. Inst. Fourier (Grenoble) 59 (2009), no. 6, 2175–2222. (French).

1. Graph modulus

1.1. Path families on graphs

Let ${G=(V,E)}$ be a finite (${|V|=N}$), simple (at most one undirected edge between any two distinct vertices and no loops), connected graph, where ${V}$ is the set of vertices and ${E}$ the set of edges. It will be convenient to number the vertices and identify ${V}$ with ${\{1,2,\dots,N\}}$. Then ${E\subset {V\choose 2}}$, where ${{V\choose2}}$ is the set of unordered pairs from ${V}$.

A path ${\gamma}$ in ${G}$ is simply a connected subgraph (there are finitely many).

A ${\rho}$metric is a function ${\rho: V\rightarrow [0,+\infty )}$ which is not identically zero. Given ${p>1}$, the energy (or mass) of ${\rho}$ is ${M_{p} (\rho)= \sum_{i=1}^{N}\rho (i)^{p}}$. Given a curve ${\gamma}$ its ${\rho}$length is ${\sum_{v\in\gamma}\rho (v)}$.

We will identify the space of all real-valued functions on the finite set ${V}$ with the space of column vectors in ${N}$ coordinates, namely ${{\mathbb R}^{N}}$. So given a ${\rho}$-metric, we will think of it as a vector ${(\rho (1),\rho (2),\dots,\rho (N))^{T}}$.

To a path ${\gamma}$ in ${G}$ we will associate its indicator function ${\mathbb{I}_\gamma}$ which equals ${1}$ at vertices in ${\gamma}$ and ${0}$ otherwise. It will be useful to think of ${\mathbb{I}_{\gamma}}$ as a row vector belonging to the dual of ${{\mathbb R}^{N}}$.

With these notations the ${\rho}$-length of ${\gamma}$ is simply the inner product

$\displaystyle \langle \rho, \mathbb{I}_{\gamma} \rangle.$

Given a path family ${\Gamma}$, we say that ${\rho}$ is admissible for ${\Gamma}$ if ${\langle \rho, \mathbb{I}_{\gamma} \rangle \geq 1}$ for every ${\gamma\in \Gamma}$. The set $\mathcal{A}$ of all admissible metrics for ${\Gamma}$ is therefore an intersection of half-planes, hence a convex set:

$\displaystyle \mathcal{A} = \bigcap_{\gamma \in\Gamma}\{x\in {\mathbb R}^N: \langle x,\mathbb{I}_{\gamma}\rangle\geq 1\}$

and ${\Gamma}$ becomes the set of directions of such half-planes. This is a standard construction in convex analysis.

The ${p}$modulus of ${\Gamma}$ is

$\displaystyle \mathrm{Mod}_{p} (\Gamma)= \min_{\rho\in \mathcal{A}}M_{p} (\rho)$

Metrics ${\rho}$ that attain the minimum will be called extremal metrics.

Notice that ${M_{p} (\cdot)}$ is strictly convex on ${{\mathbb R}^{N}}$ when ${p>1}$, since it’s the ${p}$-th power of the ${\ell_{p}}$ norm on ${{\mathbb R}^{N}}$. In particular this implies that extremal metrics are unique.

An easy computation shows that the gradient of ${M_{p} (\rho )=\|\rho\|_{p}^{p}}$ is a vector whose ${j}$-th coordinate is ${p\rho_{j}|\rho_{j}|^{p-2}}$. For simplicity we will normalize it by dividing by ${p}$ and write

$\displaystyle \nabla M_{p} (\rho )=\rho |\rho |^{p-2} \ \ \ \ \ (1)$

where we are thinking of ${\rho}$ has a real-valued function on ${V}$.

1.2. Beurling’s Criterion for extremal metrics

Suppose now that ${\rho}$ is an admissible metric for a path family ${\Gamma}$. Define

$\displaystyle \Gamma_{0} (\rho)=\{\gamma \in \Gamma : \langle \rho, \mathbb{I}_\gamma \rangle=1 \}. \ \ \ \ \ (2)$

Notice that if ${\rho}$ is extremal for ${\Gamma}$, then necessarily ${\Gamma_{0} (\rho)}$ is not empty. Namely, in the language of convex analysis, ${\Gamma_{0} (\rho)}$ is the set of “supporting planes” for ${\mathcal{A}}$ at ${\rho}$.

The classical Beurling’s Criterion in the continuous case gives a sufficient condition for a metric ${\rho}$ to be extremal. The proof passes to the graph case unchanged.

Theorem 1 (Beurling’s Criterion)
Given a path family ${\Gamma}$ in a finite connected graph ${G}$ as above. Let ${\rho}$ be an admissible metric for ${\Gamma}$ and define ${\Gamma_{0} (\rho)}$ as in (2). Suppose that there is a subfamily ${\tilde{\Gamma}\subset\Gamma_{0} (\rho)}$ with the property that whenever ${h:V\rightarrow {\mathbb R}}$ (i.e. ${h\in {\mathbb R}^{N}}$) satisfies

$\displaystyle \langle h, \mathbb{I}_{\gamma}\rangle\geq 0 \qquad \forall \gamma \in\tilde{\Gamma}, \ \ \ \ \ (3)$

this implies that

$\displaystyle \langle h, \nabla M_{p} (\rho)\rangle\geq 0. \ \ \ \ \ (4)$

Then ${\rho}$ is extremal for ${\Gamma}$.

Proof: If ${\sigma\in \mathcal{A}}$, then ${h=\sigma -\rho}$ satisfies (3). So using the explicit form of (1),

$\begin{array}{lll} \|\rho \|_{p}^{p} & = \langle\rho , \nabla M_{p} (\rho)\rangle & \qquad \mathrm{by}\ (1)\\ &&\\ & \leq \langle \sigma ,\nabla M_{p} (\rho)\rangle & \qquad \mathrm{by}\ (4)\\ &&\\ & \leq \|\sigma \|_{p}\|\nabla M_{p}(\rho)\|_{p/ (p-1)} & \qquad \mathrm{by}\ \mathrm{Holder}\\ &&\\ & = \|\sigma \|_{p}\|\rho\|_{p}^{p-1} & \qquad \mathrm{by}\ (1) \end{array}$

Since ${\|\rho \|_{p}}$ is non-zero and finite (${\rho}$ is admissible), we can divide and obtain that ${M_{p} (\rho )=\|\rho\|_{p}^{p}\leq \|\sigma \|_{p}^{p}=M_{p} (\sigma)}$.

$\Box$

1.3. A converse to Beurling’s Criterion

In the graph case one can prove a converse to Beurling’s Criterion.

Theorem 2 (A Converse to Beurling’s Criterion)
Given a path family ${\Gamma}$ in a finite connected graph ${G}$ as above. Let ${\rho}$ be an admissible metric for ${\Gamma}$ and define ${\Gamma_{0} (\rho)}$ as in (2). If ${\rho}$ is extremal for ${\Gamma}$, then whenever ${h:V\rightarrow {\mathbb R}}$ (i.e. ${h\in {\mathbb R}^{N}}$) satisfies

$\displaystyle \langle h, \mathbb{I}_{\gamma}\rangle\geq 0 \qquad \forall \gamma \in\Gamma_{0} (\rho), \ \ \ \ \ (5)$

this implies that

$\displaystyle \langle h, \nabla M_{p} (\rho)\rangle\geq 0. \ \ \ \ \ (6)$

The proof hinges on Lagrange Multipliers as we will see below during the proof of the following lemma.

Lemma 3
Suppose ${G}$, ${p}$, ${\Gamma}$ are defined as above and assume that ${\rho}$ is an extremal metric for ${\Gamma}$. Define ${\Gamma_{0} (\rho)}$ as in (2).

Then ${\nabla M_{p} (\rho)}$ is contained in the positive cone generated by the normal vectors ${\{\mathbb{I}_{\gamma} \}_{\gamma\in \Gamma_{0}}}$, i.e. there are constants ${\lambda_{\gamma}\geq 0}$ for ${\gamma\in\Gamma_{0}}$ such that

$\displaystyle \nabla M_{p} (\rho)=\sum_{\gamma \in \Gamma_{0}}\lambda_{\gamma}\mathbb{I}_{\gamma}.$

Assuming Lemma 3, we can prove the converse to Beurling’s Criterion.

Proof of Theorem 2:
Suppose ${\rho}$ is extremal for ${\Gamma}$. Let ${\Gamma_{0} (\rho)}$ as in (2). Assume that ${h:V\rightarrow {\mathbb R}}$ satisfies (5). Then

$\begin{array}{lll} \langle h, \nabla M_{p} (\rho)\rangle & = \sum_{v\in V}h (v)\sum_{\gamma \in \Gamma_{0} (\rho)}\lambda_{\gamma}\mathbb{I}_{\gamma} (v) & \qquad\text{by Lemma 3}\\ &&\\ &= \sum_{\gamma \in \Gamma_{0} (\rho)}\lambda_{\gamma}\sum_{v\in V} h (v)\mathbb{I}_{\gamma} (v)\geq 0 & \qquad\text{by (5)} \end{array}$

$\Box$

Proof of Lemma 3:
Assume that ${\Gamma \setminus \Gamma_{0}}$ is not empty. Then

$\displaystyle \delta= \min_{\gamma \in \Gamma \setminus \Gamma_{0}} \langle \rho , \mathbb{I}_{\gamma} \rangle - 1 > 0$

because ${\Gamma}$ is finite.

Also let

$\displaystyle \Delta=\bigcap_{\gamma\in \Gamma_{0}}\{x\in \mathbb{R}^{N}: \langle x, \mathbb{I}_{\gamma} \rangle= 1\},$

i.e., the intersection of the so-called supporting planes at ${\rho}$. Clearly ${\rho \in \Delta}$, by construction and ${M_{p}}$ restricted to ${\Delta}$ will admit a unique minimizer which we will call ${\rho_{1}}$.

Claim 1
${\rho_{1}=\rho}$.

Proof of Claim 1:
A priori we do not know how ${\rho_{1}}$ behaves on paths ${\gamma \in \Gamma\setminus \Gamma_{0}}$. So consider the convex combinations ${\rho_{t}=t\rho_{1}+ (1-t)\rho}$ for ${0\leq t\leq 1}$. Since both ${\rho}$ and ${\rho_{1}}$ are in ${\Delta}$ and ${\Delta}$ is convex, ${\rho_{t}\in\Delta}$ for all ${t}$. For ${\gamma \in \Gamma \setminus \Gamma_{0}}$,

$\displaystyle \langle \rho_{t} , \mathbb{I}_{\gamma} \rangle \geq t \langle \rho_{1} , \mathbb{I}_{\gamma} \rangle+ (1-t) (1+\delta) =1+\delta +O (t)$

as ${t\rightarrow 0}$. Therefore, for ${t}$ small ${\rho_{t}}$ is an admissible metric for ${\Gamma}$. On the other hand, if ${\rho \neq\rho_{1}}$ then by strict convexity ${M_{p} (\rho_{t}) for all ${0< t\leq 1}$. This contradicts the fact that ${\rho}$ is extremal for ${\Gamma}$. So ${\rho=\rho_{1}}$ and the Claim is proved.

$\Box$

We now know that ${M_{p}}$ attains its minimum value over ${\Delta}$ at ${\rho}$. By Lagrange Multipliers, ${\nabla M_{p} (\rho)}$ must be orthogonal to the affine space ${\Delta}$, i.e., ${\nabla M_{p} (\rho)}$ must be in the span ${S}$ of ${\{\mathbb{I}_{\gamma} \}_{\gamma \in \Gamma_{0}}}$. We want to show that ${\nabla M_{p} (\rho)}$ is actually in the closure of the positive cone spanned by the ${\mathbb{I}_{\gamma}}$‘s.

Let ${F=\{x\in S: \langle x, \nabla M_{p} (\rho )\rangle= 0 \}}$ be the orthogonal complement of ${\nabla M_{p} (\rho)}$ in ${S}$. Note that, since ${\rho}$ is admissible,

$\displaystyle \langle \nabla M_{p} (\rho ), \mathbb{I}_{\gamma} \rangle= \sum_{v\in\gamma}\rho (v)^{p-1}>0. \ \ \ \ \ (7)$

So the angle between ${\mathbb{I}_{\gamma}}$ and ${\nabla M_{p} (\rho)}$ is in ${(-\pi /2,\pi /2)}$.

Project ${\nabla M_{p} (\rho)}$ onto ${F}$ along the directions ${\mathbb{I}_{\gamma}}$, i.e. find points ${p_{\gamma}\in F}$ and scalars ${a_{\gamma}\in {\mathbb R}}$ such that

$\displaystyle p_{\gamma}+a_{\gamma}\mathbb{I}_{\gamma}=\nabla M_{p} (\rho) \ \ \ \ \ (8)$

Taking the inner product with ${\nabla M_{p} (\rho)}$ and using (7) one finds that ${a_{\gamma}>0}$ for all ${\gamma \in \Gamma_{0}}$.

Notice that ${0}$ is in the convex hull of ${\{p_{\gamma}\}_{\gamma \in \Gamma_{0}}}$ if and only if ${\nabla M_{p} (\rho)}$ is in the positive cone generated by ${\{\mathbb{I}_{\gamma} \}_{\gamma \in \Gamma_{0}}}$. In particular, if ${0}$ is not in the convex hull, then we can find ${\sigma \in F}$ so that

$\displaystyle \langle -\sigma ,p_{\gamma}\rangle > 0\qquad \forall \gamma \in \Gamma_{0},$

Since ${\Gamma_{0}}$ is finite, using (8), we find that there is ${\eta >0}$ such that

$\displaystyle \langle \sigma ,\mathbb{I}_{\gamma}\rangle > \eta >0 \qquad \forall \gamma \in \Gamma_{0}.$

Let ${\sigma_{t}= \rho +t\sigma}$, for ${t\geq 0}$. If ${\gamma \in \Gamma_{0}}$, then

$\displaystyle \langle \sigma_{t} ,\mathbb{I}_{\gamma}\rangle\geq \langle \rho ,\mathbb{I}_{\gamma}\rangle+t\langle \sigma ,\mathbb{I}_{\gamma}\rangle \geq 1+t\eta.$

While if ${\gamma \in \Gamma \setminus \Gamma_{0}}$, then

$\displaystyle \langle \sigma_{t} ,\mathbb{I}_{\gamma}\rangle \geq (1+\delta)+t\langle \sigma ,\mathbb{I}_{\gamma}\rangle \geq 1+t\eta$

for ${t}$ sufficiently small.

Given a node ${v\in V}$ such that ${\rho (v)>0}$, we have ${\sigma_{t} (v)>0}$ for ${t}$ small. Also if ${\rho (v)=0}$, then ${|\sigma_{t} (v)|^{p}=o(t)}$. With this let’s calculate the mass of ${\sigma_{t}}$.

$\displaystyle \begin{array}{rcl} M_{p} (\sigma_{t}) & = & \sum_{v\in V}|\rho (v)+t\sigma (v)|^{p}\\ &&\\ & = & \sum_{v: \rho (v)>0}\rho (v)^{p}\left (1+t\frac{\sigma (v)}{\rho (v)}\right)^{p}+o (t)\\ &&\\ & = & \sum_{v: \rho (v)>0}\rho (v)^{p}\left ( 1+tp\frac{\sigma (v)}{\rho (v)}\right)+o (t)\\ &&\\ & = & M_{p} (\rho)+t\sum_{v: \rho (v)>0}p\rho (v)^{p-1}\sigma (v)+o (t)\\ &&\\ & = & M_{p} (\rho)+tp\langle \sigma ,\nabla M_{p} (\rho)\rangle +o (t)=M_{p} (\rho)+o (t) \end{array}$

Thus for ${t}$ small, the metric ${\sigma_{t}/ (1+\eta t)}$ is admissible and has mass ${M_{p} (\rho)-\eta M_{p} (\rho)t+o (t). This contradicts the extremality of ${\rho}$.

$\Box$

An elementary introduction to data and statistics

The first part of this material should be accessible to a fourth-grader, the latter part to a middle-schooler. The initial Mental Math trick can be taught without algebra, even though in order to describe the method on paper we found it useful to introduce some notation.

For us a distribution, or simply the data, is a finite string of numbers, such as

$\displaystyle \{49,50,52,53,54\}$

These numbers could represent the weights in kilograms of five kids, or the monetary amounts in the pockets of five friends. The specific context only comes into play when interpreting the statistics. Statistics are computations that reveal some information about the data, but also “forget” about most of the distinguishing features of a given distribution.

For instance,  the mean, or average of the distribution is computed by adding up all the data into a sum

$\displaystyle S=\mbox{ the sum of the data}$

and then dividing it up equally among the number of data.

Some algebra helps clarify the phrasing here. The first piece of data is usually represented by the letter ${x_1}$, the second by ${x_2}$, etc…and the last one by ${x_n}$. This makes it clear that we are dealing with ${n}$ numbers. For instance, if we are dealing with the numbers ${\{49,50,52,53,54\}}$ then we would think of ${x_1=49, x_2=50,x_3=52,x_4=53,x_5=54}$. The number of data ${n}$ here is equal to ${5}$ and the sum ${S}$ is….something. The average is usually written as ${\bar{x}}$ and in this case it would be computed by dividing ${S}$ by ${5}$.

The task of adding up all the data seems daunting at first, but here is a trick that allows you to do the calculation “mentally”.

• First make a guess. If you haven’t learned about negative numbers, then always make your guess to be the smallest datum. In our example, your guess would then be ${49}$. The notation for the guess is ${y}$.
• Now subtract off your initial guess from each of the data and get a new set of data that will be easier to manage. In our example with guess ${y=49}$, the new distribution is ${\{0,1,3,4,5\}}$. We give a name to these new data, we will call them the errors and write ${e_1,\dots, e_5}$.
• Now we average the errors. In our example it is much easier now to add up the errors and get ${0+1+3+4+5=13}$. Even though ${e_1}$ is zero it still counts as a piece of data, so we still must divide by ${5}$. Hence we get that the errors’ average to ${13/5}$, or ${2\frac{3}{5}}$.
• Finally, add the averaged errors to your initial guess, and voila’! For us:

$\displaystyle 49+2\frac{3}{5}=51\frac{3}{5}.$

You can practice this strategy on a few more examples and then you will be able to impress friends and family with amazing mental skills. Try and ask your mom to give you four numbers between ${1350}$ and ${1400}$. She’ll say something like: ${1357,1366,1381,1372}$. Give her a calculator and ask her to average these numbers (she should be able to do that) but tell her not to tell you the answer. Now guess a number roughly in between, say ${1370}$. Subtract it off and get ${-13,-4,11,2}$. These add up to ${-4}$. What luck! ${-4\div 4=-1}$. So the average is ${1370-1=1369}$!

Why does this trick work? Why does it work no matter what your initial guess is? The best way to explain this is using some algebra. Luckily we’ve already set up all the necessary notation. Suppose you want to average data ${x_1,\dots,x_n}$. Make a guess ${y}$. Subtract it off and get a new data set of errors ${e_1=x_1-y,\dots,e_n=x_n-y}$. Now average these errors:

$\displaystyle \bar{e}=\frac{(x_1-y)+\cdots+(x_n-y)}{n}$

Getting rid of the parentheses and rearranging, we find that

$\displaystyle \bar{e}=\frac{(x_1+\cdots+x_n)-ny}{n}=\frac{x_1+\cdots+x_n}{n}-\frac{ny}{n}=\bar{x}-y.$

So when we add ${\bar{e}}$ to our initial guess ${y}$, we see that ${y}$ cancels out:

$\displaystyle \bar{e}+y=(\bar{x}-y)+y=\bar{x}.$

Is it possible to guess the average right the first time? Yes of course. In that case ${y=\bar{x}}$ and when you go and average the errors you find that ${\bar{e}=\bar{x}-y=0}$. In fact this property characterizes the mean:

${\bar{x}}$ is the only number that makes all the (signed) errors add up to ${0}$.

This property of ${\bar{x}}$ explains the physical intuition that is often given for the mean. Think of ${49,50,52,53,54}$ as the places along the number line where unit weights are lying on a thin tray. Then try to place a wedge (‘fulcrum’) under the tray pointing at some point with coordinate ${y}$. The tray will balance only when all the signed distances to ${y}$ add up to zero, namely when ${y=\bar{x}}$. Otherwise, the tray will crash to the floor.

Let’s go back to the interpretation of the mean in specific examples. When ${49,50,52,53,54}$ represent amounts of money, then the mean ${51\frac{3}{5}}$ (${51}$ dollars and ${60}$ cents) represents what everyone would end up with if we tried to redistribute the money, ”level the playing field”, in such a way that everyone has the same amount. That amount is the mean. Clearly this interpretation fails if ${49,50,52,53,54}$ were representing heights instead. In no way could we redistribute heights.

So in applications the interpretation of the mean must vary from context to context, and some times the information that is lost from the data when computing the mean might overshadow whatever ”statistic” is obtained. Unfortunately, in politics and in the social sciences, too often, the error is made of speaking as if the mean is representing everything one would want to know about a specific data set.

Of course statisticians have a partial answer to this problem. If it’s true that two quite different data sets may share the same average (thus losing all the information that distinguishes the two data sets), we can come up with a way of measuring how ”dispersed” a data set may be around its average. This is a ”second order” analysis. Let’s consider again our friends, the errors ${e_1=x_1-\bar{x},\dots, e_n=x_n-\bar{x}}$. We know that they add up to zero, but if we remove the signs and consider instead ${|e_1|,\dots,|e_n|}$, i.e., the distances of each piece of data from the average, how do they behave? What is their average? In words that would be ”the average distance from the average”. There is a term for this quantity, it’s called the mean deviation:

$\displaystyle {\rm MD}=\frac{|x_1-\bar{x}|+\cdots+|x_n-\bar{x}|}{n}.$

Mathematicians, it turns out, are not satified with this. Instead of just removing the sign of the errors ${e_1,\dots,e_n}$, we’d rather do it simply by ”squaring” the errors. So instead of the mean deviation, we prefer to compute the variance:

$\displaystyle {\rm Var}=\frac{(x_1-\bar{x})^2+\cdots+(x_n-\bar{x})^2}{n}$

and then, to make amends, we take the square root of the variance and call that the standard deviation.

In words, the standard deviation is ”the square root of the average square-distance from the average”. Why on earth would one want to square errors? There are many deep reasons for this and appealing to a vague resemblance to the Pythagorean Theorem would go a long way in explaining this. Instead let me give you an idea of why the variance is better, by doing a simple calculation.

What happens if we make an initial guess ${y}$ which turns out not to be the right guess: ${y\neq \bar{x}}$, and then we happily go ahead and start averaging the square distances to ${y}$ instead? What can we say about the number we would end up computing in relation to the variance? It turns out, that no matter what our initial guess is, we would always get something larger than the variance. In other words, we get another characterization of the mean:

${\bar{x}}$ is the unique value of ${y}$ that minimizes the sum of the square errors ${(x_1-y)^2+\cdots+(x_n- y)^2}$.

To see this, let’s focus on one term of the sum at a time, say the first one. We want to compare ${(x_1-y)^2}$ to ${(x_1-\bar{x})^2}$. Let’s take the difference! Then we can maybe use the remarkable identity

$\displaystyle a^2-b^2=(a-b)(a+b).$

This can be checked simply by unfolding the right hand-side.

We get

$\begin{array}{ll} (x_1-y)^2-(x_1-\bar{x})^2 & =[(x_1-y)-(x_1-\bar{x})][(x_1-y)+(x_1-\bar{x})]\\ & \\ & = [\bar{x}-y][2x_1-y-\bar{x}] \end{array}$

The same exact computation holds with ${x_1}$ replaced by any other ${x_j}$. So adding these identities up and factoring out the common term ${[\bar{x}-y]}$, we get

$\begin{array}{l} ((x_1-y)^2+\cdots+(x_n-y)^2)-((x_1-\bar{x})^2+\cdots+(x_n-\bar{x})^2)\\ \\ = [(x_1-y)^2-(x_1-\bar{x})^2]+\cdots+[(x_n-y)^2-(x_n-\bar{x})^2)]\\ \\ = [\bar{x}-y][2x_1-y-\bar{x}]+\cdots+ [\bar{x}-y][2x_n-y-\bar{x}]\\ \\ = [\bar{x}-y][2(x_1+\cdots+x_n)-ny-n\bar{x}]\\ \\ = [\bar{x}-y][2n\bar{x}-ny-n\bar{x}]\\ \\ = n(\bar{x}-y)^2>0 \end{array}$

where I used the fact that ${x_1+\cdots+x_n=n\bar{x}}$.

What this shows is that if we go ahead and compute the average square error having made a guess ${y}$, we always get a larger quantity than the variance and in fact we overshoot exactly by ${(\bar{x}-y)^2}$. The magic of squares!

Posted in General | 1 Comment

On the Euclidean Growth of Entire Functions

In Mapping properties of analytic functions on the unit disk, Proceedings of the American Mathematical Society, Vol. 135, N. 9 (2007), 2893-2898. I show that there is a universal constant $0 such that whenever $f$ is analytic in the unit disk $\mathbb{D}$ and whenever a disk  $D$ centered at $f(0)$ has the property that the Euclidean area counting multiplicity of $f$ over $D$ is strictly less than the Euclidean area of $D$, then $f$ must necessarily send the smaller disk $r_0\mathbb{D}$ into $D.$ In formulas,

$\displaystyle \int_{\{z\in\mathbb{D}: f(z)\in D\}}|f^{\prime}(z)|^2 dA(z)< A(D) \Longrightarrow f(r_0\mathbb{D})\subset D.$

At the time I did not think of rephrasing this for entire functions. It goes like this. Suppose $g$ is an entire function (say with $g(0)=0$). We measure the growth of $g$ in two different ways. Given a radius $r>0$, one quantity we can measure is the maximum modulus

$\displaystyle M(r)=\max_{|z|=r}|g(z)|.$

The other quantity we will be concerned with is $E(r)$ the Euclidean area counting multiplicity covered by $g$ when restricted to the disk $\{|z|, i.e.,

$\displaystyle E(r)=\int_{|z|

The claim then is, that under this notations and conditions,  there is an absolute  constant $\theta_0>1$ so that

$\displaystyle M(r)\leq \sqrt{E(\theta_0 r)/\pi}.$

This follows from the result in the unit disk mentioned above by picking $\theta_0>1/r_0$. Indeed, suppose that for some radius $r>0$, we have $M(r)>\sqrt{E(\theta_0 r)/\pi}$. Then,  $E(\theta_0 r)<\pi M(r)^2$. Now apply the result mentioned above to the function $f(\zeta)=g(\theta_0 r\zeta)$ and to the image disk  $D=\{|w|. The Euclidean area over $D$ covered by $f$ is less than the total Euclidean area covered by $f$, which is $E(\theta_0 r)$, and is hence strictly less than $A(D)$. So we find that $g$ must necessarily send the disk $\{|z| into $\{|w|, but this  contradicts the maximum principle because we chose $r_0\theta_0>1$.

This post is listed under “Curiosity”, because the classical and much more important way to measure the growth of an entire function is to use the Spherical area in the image. Namely,

$\displaystyle S(r)=\int_{|z|

Moreover, $S(r)$ is related to the growth of $\log M(r)$ instead of $M(r)$.

Square-roots of Complex Numbers

If $z=re^{i\theta}$ is a complex number in polar coordinates, with say $0<\theta<\pi$, then one of its square-roots is $w=\sqrt{r}e^{i\theta/2}$. But what if one wants to avoid using the exponential function?

This trick was related to me by my colleague Bob Burckel.

Draw the parallelogram generated by $0,z,|z|$ and draw its diagonal through $0$. In complex notation that’s simply $z+|z|$. By simple geometry, the angle that $z+|z|$ forms with the positive $x$-axis is half the angle that $z$ forms. So it will be enough to renormalize by dividing by $|z+|z||$ and multiplying by $\sqrt{|z|}$ and get

$\displaystyle w=\sqrt{|z|}\frac{z+|z|}{|z+|z||}.$

One can then check computationally that $w^2=z$. Try it, it’s actually not entirely straight-forward.

Complex Numbers

A complex number can be written in Cartesian coordinates as $z=x+iy$, where $x$ is the real part and $y$ is the imaginary part. Or it can be written in polar coordinates as $z=re^{i\theta}$, where $r$ is the absolute value and $e^{i\theta}$ is the point on the unit circle corresponding to the angle $\theta$ measured in the usual anti-clockwise direction from the positive $x$-semiaxis. The angle $\theta$ is called the argument, but unfortunately it is only defined up to multiples of $2\pi i$. The unimodular number $e^{i\theta}$ could be referred to as the direction of $z$.

The complex conjugate of $z$ is the complex number $\bar{z}=x-iy=r/e^{i\theta}$.

Then the Cartesian decomposition correspond to the following additive trick:

$\displaystyle 2z=(z+\bar{z})+(z-\bar{z})$

While the polar decomposition corresponds to a multiplicative trick:

$\displaystyle z^2=(z\cdot\bar{z})(z/\bar{z})$

By this I mean that $z+\bar{z}$ is twice the real part of $z$, while $z-\bar{z}$ is twice the imaginary part of $z$ times $i$. And on the other hand, $z\cdot\bar{z}$ is the square of the absolute value of $z$, and $z/\bar{z}$ is the square of the direction of $z$.