Wednesday, April 30, 2014

regularization


"generalizing from finite samples in extremely high dimensions" is an amazing problem. think of life expectancy. let's say men born today expect to live 82 years and women expect to live 84 years (don't know the actual numbers). that's based on a lot of samples of men and women (plus some projections, but let's say for now it was just based on raw statistics of past events). OK. so if you're a man, your best guess for how long you'll live is 82 years, and if a woman, your best guess is 84 years. in other words, if you were doing optimal decision-making or whatever, this is the number you should factor in (well actually the variance could be asymmetric so you ought to integrate over the whole distribution but let's ignore that for now too). OK, *but*, let's suppose the expectancy for white-men is 85 years and for non-white-men is 77 years. maybe for some reason it's reversed for women; white-women 82 years and non-white-women 86 years. NOW, if you know you're a white man, you should use 85 years as your best guess. OK, but what if you factor in height? smoking? educational history? city of birth? environmental toxin exposure? genetics? .... pretty soon, there won't even be a *single sample* of experience to draw on, to make an estimate of the life expectancy in your particular category.

one way around this problem is to assume that each factor operates linearly and independently. then, you solve a big linear regression problem, and linearly interpolate/extrapolate to points in the space where you don't have any samples. note that even under these simplifying assumptions, there are serious problems. for example, when you have many many dimensions, some will by chance have extreme slopes -- which will push your extrapolated prediction into crazy values. but even ignoring those problems, there's the bigger problem that the factors aren't actually linear or independent. they interact very strongly in fact: sometimes one factor might even reverse its direction of effect depending on another factor. so the linear regression is not a good solution.

what can you do? you have to regularize the problem. this is, you have to use some prior knowledge to drastically pare back the number of effective dimensions. for example, you might know some variables are strongly correlated, so you can treat them as one. or you might know that some variables are dominated by others, so you omit the weak variables. or you might know that the underlying function ought to be at least locally smooth, so you impose a smoothness constraint.

but what is the right regularization? brains are somehow really good at this (algorithms are getting better quickly though..)

i was just thinking that the regularizing prior knowledge could be a good place to fit in scale-invariance and holo-stuff. i haven't read enough, but i haven't seen yet much in machine learning on using a single learning framework to accommodate all different kinds/levels of data. i imagine that's what the brain is doing (maybe it's related to the relatively "unitary" consciousness that we seem have subjectively) -- because we're essentially shifting around the focus of more or less the same global model to apply to massively disparate "kinds of things". so we have more samples to draw on for any given problem. we can even turn this global model partially on itself, which perpetually gives us even more samples and might also produce other weird features.

somehow, the circuitry of the cortex must be cleverly set up to do this kind of learning over space and time. (like hawkins' and others' "hierarchical temporal memories" type ideas).

in a friston-type framework, i'm picturing that the structure of the organism at all levels already encodes a deep regularization (for example RNA only interacts with some molecules). since there's no separation between what is "inference" and what is just dynamical structure of the "inferring entity" itself. if every dynamical system can be thought of as doing prediction (which means vastly generalizing from finite samples), then its whole structure from the ground-up is scale-invariantly encoding regularization for all the external states it's exposed to.


No comments: