The family of linear models includes ordinary linear regression, Ridge regression, Lasso regression, SGD regression, and so on. The coefficients of linear models are commonly interpreted as the Feature Importance of related variables. In general, feature importance refers to how useful a feature is at predicting a target variable. For example, how useful age_of_a_house is at predicting house price.
This article summarized and explained four often neglected but crucial pitfalls in using coefficients of linear models as feature importance:
Linear regression models…
“Where Tim Hortons is, that’s my home.” Tim Hortons, a Starbucks-like coffee shop founded by a hockey player, always has special meanings for Canadians. It is Canada’s largest fast-food restaurant chain with 4,846 stores in 14 countries, by December 31, 2018, according to Wikipedia. In this article, we will use a dataset of Tim Hortons Locations from Kaggle to show how to create interactive visualizations of latitude and longitude on maps with Plotly and Python. From these plots, we can easily answer a question:
Which Tim Hortons is the one Santa Claus usually visits?
Languages are ambiguous. The text referring to the same thing could be written slightly differently, or even misspelled. Assuming that you are trying to join two tables by the column of addresses, the same location shows in table A as “520 Xavier Ave, California City” while “520 Xavier Avenue, CA” in table B. How would you handle this issue?
There are a few more examples demonstrating the same content written in different ways:
GPT-3 is OpenAI’s new language generator which blew my mind with its power in reasoning, summarizing, classification, transformation, and so on. How is GPT-3’s capability in generating text?
Since it is the holiday season, you might want to write some greeting cards to your colleagues. The problem is that if we copy the Christmas wishes found online directly, potentially, your coworker could receive the same wish from you and others.
Are you curious about what wishes GPT-3 would create? The wishes that don’t exist in the world currently? …
Detecting outliers is a crucial step in EDA (exploratory data analysis), and sometimes itself is the goal of machine learning projects. There are outliers in almost any dataset in the world. Catch and understand outliers can inspire business insights, and lead to further research or possible solutions.
How to apply data visualization to identify outliers? How to plot with Plotly or Seaborn by writing a few lines of Python code? This article will use a Covid-19 dataset as an example to demonstrate how to catch anomalies from the 50 states of America. The answers we can get from outliers include:
Data Scientist: Keep it simple.