Skip to content

Improve binning in binarize() #44

Description

@cthoyt

The current binarize function uses a cutoff of 0.5 for binarization:

rexmex/rexmex/utils.py

Lines 28 to 34 in 3e26652

def metric_wrapper(*args, **kwargs):
# TODO: Move to optimal binning. Youden’s J statistic.
y_score = args[1]
y_score[y_score < 0.5] = 0
y_score[y_score >= 0.5] = 1
score = metric(*args, **kwargs)
return score

This is an issue for PyKEEN, where the scores that come from a model could all be on the range of [-5,-2]. The current TODO text says to use https://en.wikipedia.org/wiki/Youden%27s_J_statistic, but it's not clear how that would be used.

As an alternative, the NetMF package implements the following code for constructing an indicator that might be more applicable (though I don't personally recognize what method this is, and unfortunately it's not documented):

def construct_indicator(y_score, y):
    # rank the labels by the scores directly
    num_label = np.sum(y, axis=1, dtype=np.int)
    y_sort = np.fliplr(np.argsort(y_score, axis=1))
    y_pred = np.zeros_like(y, dtype=np.int)
    for i in range(y.shape[0]):
        for j in range(num_label[i]):
            y_pred[i, y_sort[i, j]] = 1
    return y_pred

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions