Let and
both be uniform on
. Let
be the smaller of the two, and let
be the larger. Let
. So
is a random point in the unit square, and
is its distance from the origin. We can predict this distance using linear regression. For example, in R, we can pick
such points and execute the code
x=runif(10^4,0,1)
y=runif(10^4,0,1)
w=rep(0,10^4); for(i in 1:10^4){w[i]=min(x[i],y[i])}
z=rep(0,10^4); for(i in 1:10^4){z[i]=max(x[i],y[i])}
h=sqrt(w^2+z^2)
lm(h~0+w+z)
to fit a linear model of the form . The least-squares model here is, for this particular simulation,
, with
. In other words, the formula
appears to predict a as a linear function of and $\max(x,y)$ quite well, and so the hypotenuse of a triangle is 0.4278 times its shorter leg, plus 0.9339 times its longer leg. For a particular famous special case, try x = 3, y = 4; then we predict the hypotenuse is 0.4278(3) + 0.9339(4) = 5.019, quite close to the true value of 5.
Andrew Gelman and Deborah Nolan, in Teaching Statistics: A bag of tricks, give a very similar example, with slightly different numerical parameters and quip that “if Pythagoras knew about multiple regression, he might never have discovered his famous theorem”. (p. 146). They fit a model that is allowed to have nonzero constant term; I choose to fit a model with zero constant term. I think that our anachronistic Pythagoras would have had the sense to observe that if we double x and y, we should double the hypotenuse as well.
The natural question, to me, is to determine the “true” constants. So what constants a and B give the linear function that best approximates
, when we restrict to
? The reason for the triangular-shaped region is that we’re restricting to the case where $x$ is smaller and $y$ is larger. To be consistent with our Pythagoras-as-linear-regressor model, we’ll make the approximation in the least-squares sense. So we want to minimize
$f(a,b) = latex \int_0^1 \int_0^y \left( \sqrt{x^2+y^2} – (ax+by) \right)^2 \: dx \: dy $
as a function of a and b. This is a calculus problem. Expand the integrand to get
The polynomials are easy to integrate; the square-root terms somewhat less so, if it’s been a while since you’ve done freshman calculus. But after a bit of work this is
where . Differentiating we get
and
.
Set both of these equal to zero and solve to get
which are tolerably close to the coefficients that came out of the regression. (Those coefficients had standard errors of 0.0009 and 0.0005 respectively.)
Of course our hypothetical Pythagoras couldn’t have done these integrals, and would not have liked that they turn out to be irrational. Perhaps he would have just said that the length of the hypotenuse of a triangle was three-sevenths of the shorter leg, plus fourteen-fifteenths of the longer leg.
Very nice. This reminds me of an old shortcut that I read while doing graphics work on a 386: sqrt( x^2 + y^2 ) is approximately 1/2 min(x,y) + max(x,y). Translated into your above equations, this is a = 1/2, b = 1.
This is very easy to pull of when your x and y are integers or fixed-point numbers. I suppose you could pull off fixed point versions of your coefficients, too.
One advantage of a = 1/2, b = 1 has over a = 0.4269, b = 0.9343 though is that axis-aligned distances are exact in the former and underestimated in the latter.
[...] Lugo has upgraded the platform on his blog and is kind of on fire! I really liked this post about what Pythagoras would have done if he’d had access to linear regression, and this one, mathematically modeling the Lake Wobegon [...]
I tried computing the exact value of the coefficient of determination. I might be wrong, but I get
This is approximately equal to 0.995607, which has one less 9 than reported in the original post.
I tried a similar computation when you consider the hypotenuse as a function only of the longer side. Then you get r^2 of 90% (correlation is ~95%).
As a function of only the shorter side, you get r^2 of about 55% (correlation is ~74%). How does that compare with results in a science other than physics?
(I computed all of these values exactly, but they’re not pretty …)
[...] Lugo wrote a great post, following an idea of Andrew Gelman, about what would have happened if Pythagoras had known linear [...]