10 Comments

Excellent article - very clear on the specific advantages of each type. Thank you

Expand full comment

What if there is no gradient in the training data to learn? Random search just as good?

Expand full comment

We don't need gradients in the training data, that can be discrete. We need gradients in the error function that compares the output of the network with the target values, so the network itself is a differentiable function that takes perhaps discrete (or continuous) input and produces always continuous output (e.g. log probabilities of belonging to a class). So we inject gradients in a sense into the problem, by the trick of turning the learning objective into a differentiable approximation of the true learning objective.

Expand full comment

Oh and yes I said "training data" when I should have said, "Objective function," "loss function," or similar. I was in the field for 20 years starting in 1997 and ending in 2018.

Expand full comment

Excellent explanation and interesting approach. We had wide data in terms of expression data (real output, binary class label.) Often with tens of subjects and 360,000 predictors. We looked at feature/predictor selection and evolutionary computing techniques. The field has blown up with LLMs in at AI space. I have lost touch with current approaches. I am retired due to MS, so now I just piddle with photography.

Expand full comment

Damn, that's really wide! I imagine getting any classifier to work would be very hard; the system is extremely underdetermined.

Expand full comment

Absolutely! We didn't really have any choice. Getting the genome for humans was a costly and complex adventure. We could have all the way up to the full genome with 3 billion attributes/predictors! Never went over a million, and it took a long time on the supercomputers I had at my disposal. I'd much rather work on algorithms in C++ :-)

Expand full comment

"The Universal Approximation Theorem ... to any desired degree of accuracy"

From an information standpoint the total no of significant figures in the output can't exceed the total no of significant figures in the trained net. But it presumably could exceed the no of significant figures in a single trained node. How many significant figures do trained nodes usually hold? And is there a relationship between the average no in the nodes and no the output?

Expand full comment

I'm not sure if this answers your question but the universal approximation theorem is a purely analytical derivation, we assume real numbers and just find out there is a way to set a large enough number of parameters to the right values to approximate any sensible function, not too far from Fourier. In practice of course we don't have arbitrary precision, and we don't have a compact domain, we have a finite number of input examples. That's why in part two I'll go over the practicality of learning from finite data.

Expand full comment

Thanks for that. Excellent. It's at exactly the level that I needed.

Expand full comment