Apr 22, 2022

Preparing and Optimizing Categorical Values for Machine Learning

Let's assume that you have a categorical column with have these rows;

Row1: A

Row2: B

Row3: C

Row4: D

Row5: E

Row6: F

Row7: G

Row8: H

Row9: B

Row10: H

As you can see, almost every rows have different values. At this point, you should consider grouping values by similarities. If you aware that A, B, C, D are similar and E, F, G, H are similar; you can change your rows like this:

Row1: X

Row2: X

Row3: X

Row4: X

Row5: Y

Row6: Y

Row7: Y

Row8: Y

Row9: X

Row10: Y

The less categorical values in a model's output or column, the more consistent your model will be. When creating a model, there is a "One Hot Max Size" parameter. That's for categorical features, it sets distinct values less than or equal to the given parameter value. For example, if there are 7 unique categorical values in your prepared data; you can set this parameter as 7.

Making experiments by changing parameters and column values definitely lead better results for your Machine Learning models.