Preparing and Optimizing Categorical Values for Machine Learning

Preparing and Optimizing Categorical Values for Machine Learning

Let's assume that you have a categorical column with have these rows;

Row1: A

Row2: B

Row3: C

Row4: D

Row5: E

Row6: F

Row7: G

Row8: H

Row9: B

Row10: H

As you can see, almost every rows have different values. At this point, you should consider grouping values by similarities. If you aware that A, B, C, D are similar and E, F, G, H are similar; you can change your rows like this:

Row1: X

Row2: X

Row3: X

Row4: X

Row5: Y

Row6: Y

Row7: Y

Row8: Y

Row9: X

Row10: Y

The less categorical values in a model's output or column, the more consistent your model will be. When creating a model, there is a "One Hot Max Size" parameter. That's for categorical features, it sets distinct values less than or equal to the given parameter value. For example, if there are 7 unique categorical values in your prepared data; you can set this parameter as 7.

Making experiments by changing parameters and column values definitely lead better results for your Machine Learning models.

Let's start with the simplest way.

This site uses cookies. We use cookies to ensure you get the best experience on our website. For details, please check our Privacy Policy.