Modules

PU learning model

class PUClassifier(cmodel, emodel, da=False)[source]

PU learning classification model under unknown propensity. This model works by specifying a model on the classification and on the propensity and estimates parameters using EM algorithm (SAR-EM, Bekker et al.)

Parameters

cmodel (pysarpu.classification.Classifier) – an instance of class :class: Classifier <pysarpu.classification.Classifier> representing the classification model. This package includes two types of classification models: logistic regression (accessible through pysarpu.classification.LinearLogisticRegression) and linear discriminant analysis (accessible through pysarpu.classification.LinearDiscriminantClassifier)
emodel (pysarpu.propensity.Propensity) – an instance of class pysarpu.propensity.Propensity representing the propensity model. This package includes multiple pre-implemented propensity models: logistic propensity (pysarpu.propensity.LogisticPropensity), log-normal propensity (pysarpu.propensity.LogProbitPropensity) and Gumbel propensity (pysarpu.propensity.GumbelPropensity)
da (bool, optional) – whether the classification model is a discriminant analysis type model (True) or not (False). Indeed, the likelihood maximized is not the same in these two settings. Default: False.

Returns

Return an instance of PU learning model (not yet initialized).

Return type

pysarpu.PUClassifier

initialization(Xc, Xe, Y, w=1.0)[source]

Initialization of parameters for both classification and propensity models before running EM algorithm. The parameters of each models are initialized following their respective method: see initialization methods for cmodel and emodel.

Parameters

Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification. The parameters of cmodel will be initialized in agreement with the dimension of the entry data \(d_1\).
Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity. The parameters of emodel will be initialized in agreement with the dimension of the entry data \(d_2\).
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.

Returns

None

e(Xe)[source]

Propensity function using the current parameters of propensity model emodel.

Parameters: Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity.
Returns: vector of propensity scores.
Return type: numpy.array of size \(n\)

loge(Xe)[source]

Logarithm of propensity function using the current parameters of propensity model emodel.

Parameters: Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity.
Returns: vector of log-propensity scores.
Return type: numpy.array of size \(n\)

predict_cproba(Xc)[source]

Class probability predictions using the parameters of the classification model.

Parameters: Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Returns: posterior class probabilities.
Return type: numpy.array vector of size \(n\)

predict_clogproba(Xc)[source]

Class log-probability predictions using the parameters of the classification model.

Parameters: Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Returns: posterior class log-probabilities.
Return type: numpy.array vector of size \(n\)

predict_c(Xc, threshold=0.5)[source]

Class binary predictions using the parameters of the classification model.

Parameters

Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
threshold (float, optional (in \([0,1]\))) – decision threshold defining the decision rule.

Returns

class predictions.

Return type

numpy.array binary vector of size \(n\)

predict_proba(Xc, Xe)[source]

Label probability predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_cproba which returns class probabilities instead.

Parameters

Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.

Returns

posterior label probabilities.

Return type

numpy.array vector of size \(n\)

predict_logproba(Xc, Xe)[source]

Label log-probability predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_clogproba which returns class log-probabilities instead.

Parameters

Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.

Returns

posterior label log-probabilities.

Return type

numpy.array vector of size \(n\)

predict(Xc, Xe, threshold=0.5)[source]

Label binary predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_c which returns class predictions instead.

Parameters

Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
threshold (float, optional (in \([0,1]\))) – decision threshold defining the decision rule.

Returns

label binary predictions.

Return type

numpy.array binary vector of size \(n\)

loglikelihood(Xc, Xe, Y, w=1.0)[source]

Log-likelihood function given the current parameters of classification and propensity models. Note that the funciton returns the mean of individual dlog-likelihoods (instead of the usual sum).

Parameters

Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.

Returns

log-likelihood.

Return type

float

expectation(Xc, Xe, Y)[source]

Compute the expectation step of EM algorithm, return the probabilities for every instance to be of positive class given the observed labels.

Parameters

Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.

Returns

posterior probabilities

Return type

np.array vector of size \(n\)

maximisation(Xc, Xe, Y, gamma, w=1.0, warm_start=True, balance=False)[source]

Compute the maximisation step of EM algorithm, update the model parameters in both classification and propensity models.

Parameters

Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.

Returns

None

fit(Xc, Xe, Y, w=1.0, tol=1e-06, max_iter=10000.0, warm_start=False, balance=False, n_init=20, iter_init=20)[source]

Estimation of PU learning model parameters (classifier and propensity) through EM algorithm. Multiple random initialization are considered and trained over a few iterations. Then, only the one achieving the best log-likelihood is considered and trained until convergence.

Parameters

Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
tol (float, optional) – tolerance parameter. Once the increase in the log-likelihood is below tol, the algorithm stops (default 1e-6).
max_iter (int, optional) – maximum number of iterations (default: 1e4)
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False).
balance (bool, optional) – re-balance weights when fitting the propensity model in the maximization (experimental, potentially interesting in highly unbalanced situations). Default: False.
n_init (int, optional) – number of initialization to consider in the Small EM initialization strategy (default: n_init=20)
iter_init (int, optional) – maximum number of iterations to consider for each initialization (default: 20).

Returns

None

save(path)[source]

Saving PU learning model with current parameters as a binary file (rely on pickle library).

Parameters: path (str) – path at which the model should be saved.
Returns: None

Classification models

Two classification models can be found in submodule pysarpu.classification:

a Linear Logistic Regression model
a Linear Discriminant Analysis model

These two models inherit from the general class sklearn.classification.Classifier.

class LinearLogisticRegression[source]

Linear logistic regression model for classification.

Parameters: params (numpy.array vector of size \(d_1+1\)) – current parameter vector.

initialization(Xc, w=1.0)[source]

Initialization of the parameters of the model. Initial parameters are chosen randomly and the dimension of parameter vector is the dimension of the covariates + 1 (intercept).

Parameters: Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
Returns: None

fit(Xc, gamma, w=1.0, warm_start=True)[source]

Estimation of the parameters of the model given the covariates and the observed output. Note that the output does not need to be binary classes, it can consist in probability values.

Parameters

Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False).

Returns

None

eta(Xc)[source]

Class probability predictions given the current parameters.

Parameters: Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
Returns: class probabilities.
Return type: numpy.array vector of size \(n\)

logeta(Xc)[source]

Class log-probability predictions given the current parameters.

Parameters: Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
Returns: class log-probabilities.
Return type: numpy.array vector of size \(n\)

class LinearDiscriminantClassifier[source]

Linear Discriminant Analysis model for classification.

Parameters: params (dict) – current parameters: pi is the class prior, mu_0 the mean vector for negative class, mu_1 the mean vector for positive class, Sigma the covariance matrix.

initialization(Xc, w=1.0)[source]

Initialization of the parameters of the model:

the class prior pi is randomly and uniformly drawn in \([0,1]\)
the mean vectors mu_0 and mu_1 are drawn as standardized gaussian variables
the covariance matrix Sigma is initialized as the empirical covariance matrix of the whole data set.

Parameters: Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.

fit(Xc, gamma, w=1.0, warm_start=True)[source]

Estimation of the parameters of the model given the covariates and the observed output. Note that the output does not need to be binary classes, it can consist in probability values.

Parameters

Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False). Not important here as the maximization is straightforward and does not depend on the initialization.

Returns

None

eta(Xc)[source]

Class probability predictions given the current parameters.

Parameters: Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
Returns: class probabilities.
Return type: numpy.array vector of size \(n\)

logeta(Xc)[source]

Class log-probability predictions given the current parameters.

Parameters: Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
Returns: class log-probabilities.
Return type: numpy.array vector of size \(n\)

pdf_pos(Xc)[source]

Individual likelihood for the positive distribution \(\mathbb{P}(x \vert Z=1)\) and for the current parameters.

Parameters: Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
Returns: individual likelihood values.
Return type: numpy.array vector of size \(n\)

pdf_neg(Xc)[source]

Individual likelihood for the positive distribution \(\mathbb{P}(x \vert Z=0)\) and for the current parameters.

Parameters: Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
Returns: individual likelihood values.
Return type: numpy.array vector of size \(n\)

Propensity models

Three propensity models are provided in submodule pysarpu.propensity:

a Logistic Regression model
a logistic function with log-normal link function
a logistic function with Weibull link function

class LogisticPropensity[source]: Logistic Propensity : the feature vector Xe is assumed to contain an intercept as its first column

class LogProbitPropensity[source]

class GumbelPropensity[source]