image by author

The family of linear models includes ordinary linear regression, Ridge regression, Lasso regression, SGD regression, and so on. The coefficients of linear models are commonly interpreted as the Feature Importance of related variables. In general, feature importance refers to how useful a feature is at predicting a target variable. For example, how useful age_of_a_house is at predicting house price.

This article summarized and explained four often neglected but crucial pitfalls in using coefficients of linear models as feature importance:

  • Standardized dataset or not
  • Linear models have different opinions
  • Curse of highly correlated features
  • Stability check with cross-validation

Linear regression models…

An easy way to create interactive geographical scatter plots

All Images in the Article Created by Author

“Where Tim Hortons is, that’s my home.” Tim Hortons, a Starbucks-like coffee shop founded by a hockey player, always has special meanings for Canadians. It is Canada’s largest fast-food restaurant chain with 4,846 stores in 14 countries, by December 31, 2018, according to Wikipedia. In this article, we will use a dataset of Tim Hortons Locations from Kaggle to show how to create interactive visualizations of latitude and longitude on maps with Plotly and Python. From these plots, we can easily answer a question:

Which Tim Hortons is the one Santa Claus usually visits?

Import libraries:

Fuzzycouple: A solution for fuzzy match using tf-idf and cosine similarity

All Images in this article created by Author

Why we need fuzzy string match and what are the use cases?

Languages are ambiguous. The text referring to the same thing could be written slightly differently, or even misspelled. Assuming that you are trying to join two tables by the column of addresses, the same location shows in table A as “520 Xavier Ave, California City” while “520 Xavier Avenue, CA” in table B. How would you handle this issue?

There are a few more examples demonstrating the same content written in different ways:

  • The Queen’s Gambit vs Netflix The Queen’s Gambit (miniseries)
  • Toronto Raptors vs Raptors
  • Los Angeles Lakers vs Lakers
  • helloworld@gmail.com vs helloworld@gmail.con
  • Tesla, Inc. vs TSLA

We want…

An experiment of GPT-3 in Generating Content

All images in this article created by Author

GPT-3 is OpenAI’s new language generator which blew my mind with its power in reasoning, summarizing, classification, transformation, and so on. How is GPT-3’s capability in generating text?

Since it is the holiday season, you might want to write some greeting cards to your colleagues. The problem is that if we copy the Christmas wishes found online directly, potentially, your coworker could receive the same wish from you and others.

Are you curious about what wishes GPT-3 would create? The wishes that don’t exist in the world currently? …

Plot COVID-19 Data Using Plotly and Seaborn to Catch Outliers

Detecting outliers is a crucial step in EDA (exploratory data analysis), and sometimes itself is the goal of machine learning projects. There are outliers in almost any dataset in the world. Catch and understand outliers can inspire business insights, and lead to further research or possible solutions.

How to apply data visualization to identify outliers? How to plot with Plotly or Seaborn by writing a few lines of Python code? This article will use a Covid-19 dataset as an example to demonstrate how to catch anomalies from the 50 states of America. The answers we can get from outliers include:

Alina Zhang

Data Scientist: Keep it simple.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store