We don't need gradients in the training data, that can be discrete. We need gradients in the error function that compares the output of the network with the target values, so the network itself is a differentiable function that takes perhaps discrete (or continuous) input and produces always continuous output (e.g. log probabilities of belonging to a class). So we inject gradients in a sense into the problem, by the trick of turning the learning objective into a differentiable approximation of the true learning objective.
Oh and yes I said "training data" when I should have said, "Objective function," "loss function," or similar. I was in the field for 20 years starting in 1997 and ending in 2018.
Excellent explanation and interesting approach. We had wide data in terms of expression data (real output, binary class label.) Often with tens of subjects and 360,000 predictors. We looked at feature/predictor selection and evolutionary computing techniques. The field has blown up with LLMs in at AI space. I have lost touch with current approaches. I am retired due to MS, so now I just piddle with photography.
Absolutely! We didn't really have any choice. Getting the genome for humans was a costly and complex adventure. We could have all the way up to the full genome with 3 billion attributes/predictors! Never went over a million, and it took a long time on the supercomputers I had at my disposal. I'd much rather work on algorithms in C++ :-)
"The Universal Approximation Theorem ... to any desired degree of accuracy"
From an information standpoint the total no of significant figures in the output can't exceed the total no of significant figures in the trained net. But it presumably could exceed the no of significant figures in a single trained node. How many significant figures do trained nodes usually hold? And is there a relationship between the average no in the nodes and no the output?
I'm not sure if this answers your question but the universal approximation theorem is a purely analytical derivation, we assume real numbers and just find out there is a way to set a large enough number of parameters to the right values to approximate any sensible function, not too far from Fourier. In practice of course we don't have arbitrary precision, and we don't have a compact domain, we have a finite number of input examples. That's why in part two I'll go over the practicality of learning from finite data.
Excellent article - very clear on the specific advantages of each type. Thank you
What if there is no gradient in the training data to learn? Random search just as good?
We don't need gradients in the training data, that can be discrete. We need gradients in the error function that compares the output of the network with the target values, so the network itself is a differentiable function that takes perhaps discrete (or continuous) input and produces always continuous output (e.g. log probabilities of belonging to a class). So we inject gradients in a sense into the problem, by the trick of turning the learning objective into a differentiable approximation of the true learning objective.
Oh and yes I said "training data" when I should have said, "Objective function," "loss function," or similar. I was in the field for 20 years starting in 1997 and ending in 2018.
Excellent explanation and interesting approach. We had wide data in terms of expression data (real output, binary class label.) Often with tens of subjects and 360,000 predictors. We looked at feature/predictor selection and evolutionary computing techniques. The field has blown up with LLMs in at AI space. I have lost touch with current approaches. I am retired due to MS, so now I just piddle with photography.
Damn, that's really wide! I imagine getting any classifier to work would be very hard; the system is extremely underdetermined.
Absolutely! We didn't really have any choice. Getting the genome for humans was a costly and complex adventure. We could have all the way up to the full genome with 3 billion attributes/predictors! Never went over a million, and it took a long time on the supercomputers I had at my disposal. I'd much rather work on algorithms in C++ :-)
"The Universal Approximation Theorem ... to any desired degree of accuracy"
From an information standpoint the total no of significant figures in the output can't exceed the total no of significant figures in the trained net. But it presumably could exceed the no of significant figures in a single trained node. How many significant figures do trained nodes usually hold? And is there a relationship between the average no in the nodes and no the output?
I'm not sure if this answers your question but the universal approximation theorem is a purely analytical derivation, we assume real numbers and just find out there is a way to set a large enough number of parameters to the right values to approximate any sensible function, not too far from Fourier. In practice of course we don't have arbitrary precision, and we don't have a compact domain, we have a finite number of input examples. That's why in part two I'll go over the practicality of learning from finite data.
Thanks for that. Excellent. It's at exactly the level that I needed.