Floating point calculations are slow for computers (specifically CPUs); possibly representing the same struggle for many humans. 🙂
I remember a time when a FPU (floating point unit) was an upgrade and one had to pay extra to get one. Very useful when you needed that extra precision in computing – and in my head, it always seemed like the Turbo button. 🙂
For most #ML workloads and computations, precision isn’t the most important criteria; with every increasing data and parameters (looking at you GPT-3 with 45 TB of data and 175 billion parameters!), what most ML needs today is speed and dynamic range.
This is where bfloat16 (Brain floating-point format with 16 bits) – a new floating-point format comes handy and in the context of #AI improves on IEEE 754 – the current floating-point arithmetic standard.
As per IEEE 754, a floating point it will always take up 32 bits (see Figure 1 below) – irrespective of the size of the number. The exponent (8 bits) tells us how many numbers we shift (left or right) and place the decimal. The fraction (23 bits), also called the mantissa, holds the actual number – i.e. the data.
bfloat16 truncates the data size in a third (see Figure 2) – with the fraction truncated from 23 to 7 bits. This of course means bfloat16 isn’t as precise. However bfloat16 has the same exponent bits as IEEE-754 it can represent a similar range (small to large), but more importantly are easier to convert between bfloat16 and IEEE 754.
Less precision doesn’t impact the matrix multiplication as much so in the context of ML training and inference these chips at scale are more efficient – not only they are faster, they also use less power, and memory bandwidth.
What is interesting in some neural nets such as a DNN, these less precision bfloat16 are more precise compared to IEEE 754! This is because the regularization and quantization weights cannot use the finer precision represented by IEEE 754 but adapt better with bfloat16. 🙂
Finally, bfloat16 is not a universal standard (yet); most AI chips support this. ARM, Intel, and, AMD have started adding support for this in their chipsets.
The naming is unfortunate when talking about #AI. There isn’t anything about intelligence – not as we humans know of it. If we can rewind back to the 50’s we can perhaps rename it to something like Computational Intelligence, which is more accurate. And although I have outlined the difference between some of the elements of AI in the past, I wanted to get back to what the intent was and how this area started.
Can machines think? Some say, the origins of #AI go back to Turing and started with his paper “Computing machinery and intelligence” (PDF) when it was published in 1950.Whilst, Turing might have planed the seed, it was a program called Logic Theorist created Allen Newell, Cliff Shaw, and Herbert Simon which was the first #ArtificialIntelligence program. Of course it wasn’t called #AI then.
Since then, AI has had a roller coaster of a ride over the decades – from colder than hell (I presume) winters, to hotter than lava with it being everywhere. As someone said, time will heal all wounds.
Today, many of us use #AI, #DeepLearning, and, #MachineLearning interchangeably. Over the course of last couple of years, I have learned to ignore that, but fundamentally the distinction is important.
AI, we would say is more computational intelligence – allowing computers to do tasks that would be difficult for humans to do, certainly at scale. And these tasks are accomplished using different mechanisms and techniques, using “intelligent agents”.
Machine learning is a subset of AI, where the program or algorithm can learn from previous outputs, and improve based on that data – hence the “learning” part. It is akin to it learning from experience, but isn’t the same thing as we humans can comprehend and understand. Some of us think, the program is rewriting itself, which technically isn’t an accurate description.
Deep Learning is a set of techniques and algorithms of machine learning that are inspired from how the neurals in our brain connect together and work. These set of techniques are also called Neural Networks, and essentially are nothing but type of machine learning
For any of this AI “magic” to work, the one thing it needs to feed on is data. Without data, none of this would be possible. This data is classified into two categories – features and labels.
Features – these are aspects of whatever we are interested in. For example if we are interested in vehicles features could be the colour, make, and, model of the vehicle.
Labels – these are buckets of categories we put the things we are interested in. Using the same vehicles examples, we can have labels such as SUV, Sedan, Sports Car, Trucks, etc. that categorize vehicles.
One key principle to remember when it comes to #AI – all the outcomes that are described are in the terms of probabilities and not absolutes. All it suggests is the likelihood of something to happen, and most things cannot be predicted with total certainty. And this fundamental aspect one should remember when making decisions.
There isn’t a universal definition of AI, which sometimes doesn’t help. Each has their own perception. I have gotten over it to come to their terms and ensure we are talking the same lingo and meaning. It doesn’t help to get academic about it. 🙂
For example taking three leading analysts (Gartner, IDC, and Forrester) definition of AI (outlined below) is a good indicator on how this can get confusing.
Gartner – At its core, AI is about solving business problems in novel ways. It stretches across any organization from innovation, R&D and IT to data science.
IDC defines cognitive/Artificial Intelligence (AI) systems as a set of technologies that use deep natural language processing and understanding to answer questions and provide recommendations and direction. IDC’s coverage of cognitive/AI systems examines:
Artificial intelligence, deep learning and machine learning
Automated recommendation systems
Forrester defines AI as a liberatory technology at its core, and businesses that integrate it will free workers to become more innovative, creative, and adaptive than ever before. But these technologies are still in early stages.
And the field is just exploding now – not just with new research around #DeepLearning or #MachineLearning, but also net new aspects from a business perspectives; things like:
Democratization of AI
Data Engineering (OK, not new, but certainly key)
RPA (or #IntelligentAutomation)
It is a new and exciting world that spans multiple spectrum. Don’t try and drink from the fire-hose, but take it in slowly, appreciate the nuances and what one brings value and discuss in terms of outcomes.
Regularization is a fundamental concept in Machine Learning (#ML) and is generally used with activation functions. It is the key technique that help with overfitting.
Overfitting is when an algorithm or model ‘fits’ the training data too well – it seems to good to be true. Essentially overfitting is when a model being trained, learns the noise in the data instead of ignoring it. If we allow overfitting, then the network only uses (or is more heavily influenced) by a subset of the input (the larger peaks), and doesn’t factor in all the input.
The worry there being that outside of the training data, it might not work as well for ‘real world’ data. For example the model represented by the green line in the image below (credit: Wikipedia), follows the sample data too closely and seems too good. On the other hand, the model represented by the black line, which is better.
Regularization helps with overfitting (artificially) penalizing the weights in the neural network. These weights are represented as peaks, and this reduces the peaks in the data. This ensure that the higher weights (peaks) don’t overshadow the rest of the data, and hence getting it to overfit. This diffusion of the weight vectors is sometimes also called weight decay.
Although there are a few regularization techniques for preventing overfitting (outlined below), these days in Deep Learning, L1 and L2 regression techniques are more favored over the others.
Cross validation: This is a method for finding the best hyper parameters for a model. E.g. in a gradient descent, this would be to figure out the stopping criteria. There are various ways to do this such as the holdout method, k-fold cross validation, leave-out cross validation, etc.
Step-wise regression: This method essentially is a serial step-by-step regression where one reduces the weakest variable. Step-wise regression essentially does multiple regression a number of times, each time removing the weakest correlated variable. At the end you are left with the variables that explain the distribution best. The only requirements are that the data is normally distributed, and that there is no correlation between the independent variables.
L1 regularization: In this method, we modify the cost function by adding the sum of the absolute values of the weights as the penalty (in the cost function). In L1 regularization the weights shrinks by a constant amount towards zero. L1 regularization is also called Lasso regression.
L2 regularization: In L2 regularization on the other hand, we re-scale the weight to a subset factor – it shrinks by an amount that is proportional to the weight (as outlined in the image below). This shrinking makes the weight smaller and is also sometimes called weight decay. To get this shrinking proportional, we take a squared mean of the weights, instead of the sum. At face value it might seem that the weight eventually get to zero, but that is not true; typically other terms cause the weights to increase. L2 regularization is also called Ridge regression.
Max-norm: This enforces a upper bound on the magnitude of the weight vector. The one area this helps is that a network cannot ‘explode’ when the learning rates gets very high, as it is bounded. This is also called projected gradient descent.
Dropout: Is very simple, and efficient and is used in conjunction with one of the previous techniques. Essentially it adds a probably on the neuron to keep it active, or ‘dropout’ by setting it to zero. Dropout doesn’t modify the cost function; it modifies the network itself as shown in the image below.
Increase training data: Whilst one can artificially expand the training set theoretically possible, in reality won’t work in most cases, especially in more complex networks. And in some cases one might think also to artificially expand the dataset, typically it is not cost effective to get a representative dataset.
Between L1 and L2 regularization, many say that L2 is preferred, but I think it depends on the problem statement. Say in a network, if a weight has a large magnitude, L2 regularization shrink the weight more than L1 and will better. Conversely, if the weight is small then L1 shrinks the weight more than L2 – and is better as it tends to concentrate the weight in fewer but more important connections in the network.
In closing, the key aspect to appreciate – the small weights (peaks) in a regularized network essentially means that as our input changes randomly (i.e. noise), it doesn’t have a huge impact to the network and its output. So this makes it difficult for the network to learn the noise and respond to that. Conversely, in an unregularized networks, that has higher weights (peaks), small random changes to those weights can have a larger impact to the behavior of the network and the information it carries.
Neural Networks, today, help in a great set of tasks, that until very recently wasn’t possible at all – be it from computer vision, to medical diagnosis, to speech translation and forms a key cornerstone to a lot of ‘magic’ that Machine Learning and AI offers today.
It is this aspect of neural networks that allow us to map any process and generate a corresponding function. Unlike a function in Computer Science, this function isn’t deterministic; instead is confidence score of an approximation (i.e. a probability). The more layers in a neural network, the better this approximation will be.
In a neural network, typically there is one input layer, one output layer, and one or more layers in the middle. To the external system, only the input layer (values of ), and the final output (output of the function ) are visible, and the layers in the middle are not and essentially hidden.
Each layer contains nodes, which is modeled after how the neurons in the brain works. The output of each node gets propagated along to the next layer. This output is the defining character of the node, and activates the node to pass on its value to the next node; this is very similar to how a neuron in the brain fires and works passing on the signal to the next neuron.
To make this generalization of function outlined above to hold, the that function needs to be continuous function. A continuous function is one where small changes to the input value , creates small changes to the output of . If these outputs, are not small and the value jumps a lot then it is not continuous and it is difficult for the function to achieve the approximation required for them to be used in a neural network.
For a neural network to ‘learn’ – the network essentially has to use different weights and biases that has a corresponding change to the output, and possibly closer to the result we desire. Ideally small changes to these weights and biases correspond to small changes in the output of the function. But one isn’t sure, until we train and test the result, to see that small changes don’t have bigger shifts that drastically move away from the desired result. It isn’t uncommon to see that one aspect of the result has improved, but others have not and overall skewing the results.
In simple terms, an activation function is a node that attached to the output of a neural network, and maps the resulting value between 0 and 1. It is also used to connect two neural networks together.
An activation function can be linear, or non-linear. A linear isn’t terribly effective as its range is infinity. A non-linear with a finite range is more useful as it can be mapped as a curve; and then changes on this curve can be used to calculate the difference on the curve between two points.
There are many times of activation function, each either their strengths. In this post, we discuss the following six:
1. Sigmoid function
A sigmoid function can map any of input values into a probability – i.e., a value between 0 and 1. A sigmoid function is typically shown using a sigma (). Some also call the () a logistic function. For any given input value, the official definition of the sigmoid function is as follows:
If our inputs are , and their corresponding weights are , and a bias b, then the previous sigmoid definition is updated as follows:
When plotted, the sigmoid function, will look plotted looks like this curve below. When we use this, in a neural network, we essentially end up with a smoothed out function, unlike a binary function (also called a step function) – that is either 0, or 1.
For a given function, , as , tends towards 1. And, as as , tends towards 0.
And this smoothness of is what will create the small changes in the output that we desire – where small changes to the weights (), and small changes to the bias () will produce a small changes to the output ().
Fundamentally, changing these weights and biases, is what can give us either a step function, or small changes. We can show this as follows:
One thing to be aware of is that the sigmoid function suffers from the vanishing gradient problem – the convergence between the various layers is very slow after a certain point – the neurons in previous layers don’t learn fast enough and are much slower than the neurons in later layers. Because of this, generally a sigmoid is avoided.
2. Tanh (hyperbolic tangent function)
Tanh, is a variant of the sigmoid function, but still quite similar – it is a rescaled version and ranges from –1 to 1, instead of 0 and 1. As a result, its optimization is easier and is preferred over the sigmoid function. The formula for tanh, is
Using, this we can show that:
Tanh also suffers from the vanishing gradient problem. Both Tanh, and, Sigmoid are used in FNN (Feedforward neural network) – i.e. the information always moves forward and there isn’t any backprop.
3. Rectified Linear Unit (ReLU)
A rectified linear unity (ReLU) is the most popular activation function that is used these days.
ReLU’s are quite popular for a couple of reasons – one, from a computational perspective, these are more efficient and simpler to execute – there isn’t any exponential operations to perform. And two, these doesn’t suffer from the vanishing gradient problem.
The one limitation ReLU’s have, is that their output isn’t in the probability space (i.e. can be >1), and can’t be used in the output layer.
As a result, when we use ReLU’s, we have to use a softmax function in the output layer. The output of a softmax function sums up to 1; and we can map the output as a probability distribution.
Another issue that can affect ReLU’s is something called a dead neuron problem (also called a dying ReLU). This can happen, when in the training dataset, some features have a negative value. When the ReLU is applied, those negative values become zero (as per definition). If this happens at a large enough scale, the gradient will always be zero – and that node is never adjusted again (its bias. and, weights never get changed) – essentially making it dead! The solution? Use a variation of the ReLU called a Leaky ReLU.
4. Leaky ReLU
A Leaky ReLU will usually allow a small slope on the negative side; i.e that the value isn’t changed to zero, but rather something like 0.01. You can probably see the ‘leak’ in the image below. This ‘leak’ helps increase the range and we never get into the dying ReLU issue.
5. Exponential Linear Unit (ELU)
Sometimes a ReLU isn’t fast enough – over time, a ReLU’s mean output isn’t zero and this positive mean can add a bias for the next layer in the neural network; all this bias adds up and can slow the learning.
Exponential Linear Unit (ELU) can address this, by using an exponential function, which ensure that the mean activation is closer to zero. What this means, is that for a positive value, an ELU acts more like a ReLU and for negative value it is bounded to -1 for – which puts the mean activation closer to zero.
When learning, this derivation of the slope is what is fed back (backprop) – so for this to be efficient, both the function and its derivative need to have a lower computation cost.
And finally, there is another various of that combines with ReLU and a Leaky ReLU called a Maxout function.
So, how do I pick one?
Choosing the ‘right’ activation function would of course depend on the data and problem at hand. My suggestion is to default to a ReLU as a starting step and remember ReLU’s are applied to hidden layers only. Use a simple dataset and see how that performs. If you see dead neurons, than use a leaky ReLU or Maxout instead. It won’t make sense to use Sigmoid or Tanh these days for deep learning models, but are useful for classifiers.
In summary, activation functions are a key aspect that fundamentally influence a neural network’s behavior and output. Having an appreciation and understanding on some of the functions, is key to any successful ML implementation.
I was looking at something else and happen to stumble across something called Netron, which is a model visualizer for #ML and #DeepLearning models. It is certainly much nicer than for anything else I have seen. The main thing that stood out for me, was that it supports ONNX , and a whole bunch of other formats (Keras, CoreML), TensorFlow (including Lite and JS), Caffe, Caffe2, and MXNet. How awesome is that?
Below is a couple of examples, of visualizing a ResNet-50 model – you can see both the start and the end of the visualization shown in the two images below to get a feel of things.
And of course, this can get very complex (below is the same model, just zoomed out more).
I do think it is a brilliant, tool to help understand the flow of things, and what can one do to optimize, or fix. Also very helpful for folks who are just starting to learn and appreciate the nuances.
Someone recently asked me, what are some of the use cases / examples of machine learning. Whilst, this might seem as an obvious aspect to some of us, it isn’t the case for many businesses and enterprises – despite that they uses elements of #ML (and #AI) in their daily life – as a consumer.
Whilst, the discussion gets more interesting based on the specific domain and the possibly use cases (of course understanding that some might not be sure f the use case – hence the question in the first place). But, this did get me thinking and wanted to share one of the images we use internally as part of our training that outcomes some of the use cases.
These are not 1:1 and many of them can be combined together to address various use cases – for example a #IoT device sending in a sensor data, that triggers a boundary condition (via a #RulesEngine), that in addition to executing one or more business rule, can trigger a alert to a human-in-the-loop (#AugmentingWorkforce) via a #DigitalAssistant (say #Cortana) to make her/him aware, or confirm some corrective action and the likes. The possibilities are endless – but each of these elements triggered by AI/ML and still narrow cases and need to be thought of in the holistic picture.
Over the last few weeks, I built a self-driving car – which essentially is a remote control Rx car that uses a raspberry pi running Python, TensorFlow implementing a end-to-end convolution neural network (CNN)
Of course other than being a bit geeky, I do think this is very cool to help understand and get into some of the basic constructs and mechanics around a number of things – web page design, hardware (maker things), and Artificial Intelligence principles.
There are two different models here – they do use the same ASC and controller that can be programmed. My 3D printer, did mess up a little (my supports were a little off) and which is why you see the top not clean.
The sensor and camera are quite basic, and there is provisions to add and do better over time. The Pi isn’t powerful enough to train the model – you need another machine for that (preferably a I7 core with a GPU). Once trained you can run the model on the Pi for inference.
This is the second car, which is a little different hardware, but the ESC to control the motor and actuators are the same.
The code is simple enough; below is an example of the camera (attached) to the Pi, saving the images it is seeing. Tubs is the location where the images are saved; these can then be transferred to another machine for training or inference.
import donkey as dk
#initialize the vehicle
V = dk.Vehicle()
#add a camera part
cam = dk.parts.PiCamera()
V.add(cam, outputs=['image'], threaded=True)
#add tub part to record images
tub = dk.parts.Tub(path='~/d2/gettings_started',
#start the vehicle's drive loop
Below you can see the car driving itself around the track, where it had to be trained first. The reason it is not driving perfectly is because during training (when I was manually driving it around), I crashed a few times and as a result the training data was messed up. Needed more time to clean that up and retrain it.
This is based on donkey car – which is an open source DIY for platform for small-scale self driving cars. I think it is also perfect to get into with those who have teenagers and a little older kids to get in and experiment. You can read up more details on how to go about building this, and the parts needed here.
If you were trying to pull the latest source code on your Raspberry Pi for donkeycar, and get the following error, then probably your clock is off (and I guess some nonce is failing). This can happen if your pi had been powered off for a while (as in my case), and it’s clock is off (clock drift is a real thing) :).
fatal: unable to access 'https://github.com/wroscoe/donkey/': server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
To fix this, the following commands works. It seems the Raspberry Pi 3, by default has NTP disabled and this would enable it. I also had to check the result status with the second command, and force it with the third one.
sudo timedatectl set-ntp True
sudo timedatectl set-local-rtc true
And that should do it; you might need to reboot the pi just to get it back on and then you should be able to pull the code off git and deploy your autonomous car.
Can #AI make me look (more) presentable? The jury is out I think.
This is called style transfer, where the style/technique from a kind of painting (could be a photos too) is applied to an image, to create a new image. I took this using the built-in camera on my machine sitting at my desk and then applying the different kind of ‘styles’ on it. Each of these styles are is a separate #deeplearning model that has learned how to apply the relevant style to a source image.
Style – Candy
Style – Feathers
Style – Mosaic
Style – Robert
Specifically, this uses a Neural Network (#DeepLearning) model called VGG19, which is a 19 layer model running on TensorFlow. Of course you can export this to a ONNX model, that then can be used in most other run-times and libraries.
Thinking about #machinelearning? It will be helpful to understand some numerical computations and concepts that affect the #ML algorithm.
One might not interact with these directly, but we surely can feel the effect. The things you need to think about are:
1. Overflow and underflow – thinking of them as rounding up or down errors that shift the functions enough, and compounded across the iterations cam be devastating. Of course can also easily get to division by zero.
2. Poor conditioning – essentially with small changes of input data, how large can the output move. You want this small. (And in cryptography you want the opposite, and large).
3. Gradient optimizations – there will be some optimization happening in the algorithm, question is how does it handle various local points on the curve? Local minimum, saddle points, and local maximum. Generally speaking, it’s about optimizing continuous spaces.
Some algorithms take this a step further by measuring a second derivative (think of it as measuring the derivative of a derivative – the curvature of a function).
4. Constrained Optimization – sometimes we just want to operate on a subset – so constraints only on that set.
All of these come into play some way, directly or indirectly and having a basic understanding and constraints around this would help a long way.
I know I have had to explain this a lot in most #AI related conversations that I have had – and lately those have been quite a lot. In my experience, most people use these terms interchangeably when they are meaning one over the other.
Whilst they all are (inter)related and one might help trigger the other, they are still fundamentally different and at some point, it is good to understand the differences. I like the image below (source) that whilst on one hand is showing a time graph, the correlation between them and how one is a subset of the other is what is interesting.
#AI is getting more powerful and the potential of it which personally really excites me is the paradigm shift we are starting to see. Fundamentally it is changing on how we use, interact, and, value computers and technology.
It is shifting from us learning machines and their idiosyncrasies (remember when being computer literate was a differentiator on a resume) to this shift where technology learns us and interacts with us in a more natural, and dare I say human manner.
I almost see it as StarTrek (and now showing my age) – the computer is everywhere, yet it is no where. It is embedded and woven into everything we do on the Enterprise rather an some “thing” one interacts with.
And it is awesome to start seeing some of this coming to life, even if it is in a demo as outlined at Build a couple of weeks ago. #AI in the Workplace and how it interacts with objects in real-time and can invoke and interact Business workflow (such as workplace policies).
The degree of calculations is pretty phenomenal – 27 million / sec [separately I would love to understand the definition on calculation 🙂 ]. But then given where we are heading with a fully autonomous car generating about 100GB of data each second, this isn’t small potatoes.
And whilst you can read up more on these terms and how they link, I really like to move away from the different terms which most people confuse in the first place and start thinking of more business outcomes and how enterprises and people will use.
To that end, the three buckets of Intelligent Automation, Robotic Process Automation (RPA), and Physical Automation is what we have found work better. On RPA, the one caveat being that it is not about robots, but rather the automation of a (business) process. The robots aspect would fall under physical automation – which essentially is anything that interacts with the real/physical world.