How does AI Studio calculate the distance between instances to create clusters?
What the AI Studio clustering algorithm attempts to do is basically grouping the data points together by proximity to one another. This proximity is differently computed depending on the field type.
- For numericfields it is measured with the Euclidean distance, where the total distance from each data point to its assign centroid is minimized.
- For categorical fields, AI Studio uses a special binary distance (0 or 1) function where:
if valA == valB then
distance = 0
distance = 1 or user-defined scale value
AI Studio also assigns as the centroid the most common category of the member instances and then computes the Euclidean distance as normal.
- For text and items fields AI Studio follows a different approach and uses cosine similarity to calculate the distance metric. The terms the algorithm picks for a centroid are the terms that minimize the average cosine distance between the centroid and the points in its neighborhood.