by: Kristina Rogale Plazonic
What is machine learning?
– learning to predict from existing examples, without explicitly programming the rules used for prediction
How can machine learning help your business?
– removes manual work – saves time and money and in many cases improves accuracy (add examples)
What conditions need to be satisfied in order to apply ML?
– you need to have collected useful data about your problem – this is very non-trivial and requires foresight (add examples – e.g. if you want to predict if your computer in a data center is about to fail, it might be useful to collect core temperature – not usually part of the logs)
– your data needs to have patterns – if attributes/features don’t capture the objects they represent, or positive and negative examples fall randomly amongst each other, ML cannot be applied successfully (add picture of examples of positive and negative examples distribution, where machine learning can and cannot be applied)
How are ML algorithms different from (better than) rules engines?
– no need to explicitly add rules – rules are produced by minimizing some criterion (like minimizing error between predicted value and actual value)
– ML models generalize better on previously unseen data
– ML models in general give a confidence score, which allows you to prioritize examples for inspection
What is a random forest?
– a collection of many decision trees. Each decision tree is trained on a different subsample of data and using a random sample of attributes. This helps with generalizing to unseen examples, because no one feature will predominate.
Why is a random forest better than a single decision tree?
– it generalizes better to unseen examples
– no single tree may be very good at predicting
– as long as each tree in the forest is better than 50% at predicting, taking many such weak predictors together allows you to achieve excellent accuracy – think of the audience lifeline in “who wants to be a millionaire” – in general ensembles are better!
What are some good features of random forests?
– work well with little tuning,
– work with both continuous and categorical attributes,
– work well on a variety of different datasets,
– models are interpretable.
– they capture nonlinearities and interactions between the attributes.
– are easily parallelizable.
– it is very fast to produce predictions.
– in a 2014 bakeoff of 179 classifiers from 17 families on 121 datasets – 3 out of top 5 were random forests.
How can you improve ML models in general?
– you can combine different models – this is “ensembling” – different models might have a different perspective on the data – this technique is ALWAYS used in winning data science competitions.
– you can cluster your data first and train different models on different subgroups of the data – the point is that a model might ignore a small group of data where it’s underperforming if the group’s contribution to the error term is not overall significant – training a separate model on just that group will force it to perform well.
– you can inspect your data (especially your misclassified data) and see if you can find out why misclassification happens – you might discover that adding a new feature might be good, or that you need to clean data better.
– you can remove outliers before training the model, because outliers can skew your model sometimes.
How is active learning better than retraining your model every day with new examples?
– if your new examples today contains a big group of similar examples (but this group different from old examples from yesterday), active learning will enable you to inspect and label only a small portion of the group and a ML model is automatically retrained and similar examples labeled automatically while human is working on labeling.
– in continuous retraining of the data you would either have to label the whole group today, or wait for results tomorrow.