Apr 22, 2022
Let's assume that you have a categorical column with have these rows;
Row1: A
Row2: B
Row3: C
Row4: D
Row5: E
Row6: F
Row7: G
Row8: H
Row9: B
Row10: H
As you can see, almost every rows have different values. At this point, you should consider grouping values by similarities. If you aware that A, B, C, D are similar and E, F, G, H are similar; you can change your rows like this:
Row1: X
Row2: X
Row3: X
Row4: X
Row5: Y
Row6: Y
Row7: Y
Row8: Y
Row9: X
Row10: Y
The less categorical values in a model's output or column, the more consistent your model will be. When creating a model, there is a "One Hot Max Size" parameter. That's for categorical features, it sets distinct values less than or equal to the given parameter value. For example, if there are 7 unique categorical values in your prepared data; you can set this parameter as 7.
Making experiments by changing parameters and column values definitely lead better results for your Machine Learning models.