Modules
PU learning model
- class PUClassifier(cmodel, emodel, da=False)[source]
PU learning classification model under unknown propensity. This model works by specifying a model on the classification and on the propensity and estimates parameters using EM algorithm (SAR-EM, Bekker et al.)
- Parameters
cmodel (
pysarpu.classification.Classifier
) – an instance of class :class: Classifier <pysarpu.classification.Classifier> representing the classification model. This package includes two types of classification models: logistic regression (accessible throughpysarpu.classification.LinearLogisticRegression
) and linear discriminant analysis (accessible throughpysarpu.classification.LinearDiscriminantClassifier
)emodel (
pysarpu.propensity.Propensity
) – an instance of classpysarpu.propensity.Propensity
representing the propensity model. This package includes multiple pre-implemented propensity models: logistic propensity (pysarpu.propensity.LogisticPropensity
), log-normal propensity (pysarpu.propensity.LogProbitPropensity
) and Gumbel propensity (pysarpu.propensity.GumbelPropensity
)da (
bool
, optional) – whether the classification model is a discriminant analysis type model (True) or not (False). Indeed, the likelihood maximized is not the same in these two settings. Default: False.
- Returns
Return an instance of PU learning model (not yet initialized).
- Return type
- initialization(Xc, Xe, Y, w=1.0)[source]
Initialization of parameters for both classification and propensity models before running EM algorithm. The parameters of each models are initialized following their respective method: see initialization methods for cmodel and emodel.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification. The parameters of cmodel will be initialized in agreement with the dimension of the entry data \(d_1\).
Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity. The parameters of emodel will be initialized in agreement with the dimension of the entry data \(d_2\).
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
- Returns
None
- e(Xe)[source]
Propensity function using the current parameters of propensity model emodel.
- Parameters
Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity.
- Returns
vector of propensity scores.
- Return type
numpy.array of size \(n\)
- loge(Xe)[source]
Logarithm of propensity function using the current parameters of propensity model emodel.
- Parameters
Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity.
- Returns
vector of log-propensity scores.
- Return type
numpy.array of size \(n\)
- predict_cproba(Xc)[source]
Class probability predictions using the parameters of the classification model.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
- Returns
posterior class probabilities.
- Return type
numpy.array vector of size \(n\)
- predict_clogproba(Xc)[source]
Class log-probability predictions using the parameters of the classification model.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
- Returns
posterior class log-probabilities.
- Return type
numpy.array vector of size \(n\)
- predict_c(Xc, threshold=0.5)[source]
Class binary predictions using the parameters of the classification model.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
threshold (float, optional (in \([0,1]\))) – decision threshold defining the decision rule.
- Returns
class predictions.
- Return type
numpy.array binary vector of size \(n\)
- predict_proba(Xc, Xe)[source]
Label probability predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_cproba which returns class probabilities instead.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
- Returns
posterior label probabilities.
- Return type
numpy.array vector of size \(n\)
- predict_logproba(Xc, Xe)[source]
Label log-probability predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_clogproba which returns class log-probabilities instead.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
- Returns
posterior label log-probabilities.
- Return type
numpy.array vector of size \(n\)
- predict(Xc, Xe, threshold=0.5)[source]
Label binary predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_c which returns class predictions instead.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
threshold (float, optional (in \([0,1]\))) – decision threshold defining the decision rule.
- Returns
label binary predictions.
- Return type
numpy.array binary vector of size \(n\)
- loglikelihood(Xc, Xe, Y, w=1.0)[source]
Log-likelihood function given the current parameters of classification and propensity models. Note that the funciton returns the mean of individual dlog-likelihoods (instead of the usual sum).
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
- Returns
log-likelihood.
- Return type
float
- expectation(Xc, Xe, Y)[source]
Compute the expectation step of EM algorithm, return the probabilities for every instance to be of positive class given the observed labels.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
- Returns
posterior probabilities
- Return type
np.array vector of size \(n\)
- maximisation(Xc, Xe, Y, gamma, w=1.0, warm_start=True, balance=False)[source]
Compute the maximisation step of EM algorithm, update the model parameters in both classification and propensity models.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
- Returns
None
- fit(Xc, Xe, Y, w=1.0, tol=1e-06, max_iter=10000.0, warm_start=False, balance=False, n_init=20, iter_init=20)[source]
Estimation of PU learning model parameters (classifier and propensity) through EM algorithm. Multiple random initialization are considered and trained over a few iterations. Then, only the one achieving the best log-likelihood is considered and trained until convergence.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
tol (float, optional) – tolerance parameter. Once the increase in the log-likelihood is below tol, the algorithm stops (default 1e-6).
max_iter (int, optional) – maximum number of iterations (default: 1e4)
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False).
balance (bool, optional) – re-balance weights when fitting the propensity model in the maximization (experimental, potentially interesting in highly unbalanced situations). Default: False.
n_init (int, optional) – number of initialization to consider in the Small EM initialization strategy (default: n_init=20)
iter_init (int, optional) – maximum number of iterations to consider for each initialization (default: 20).
- Returns
None
Classification models
Two classification models can be found in submodule pysarpu.classification:
a Linear Logistic Regression model
a Linear Discriminant Analysis model
These two models inherit from the general class sklearn.classification.Classifier.
- class LinearLogisticRegression[source]
Linear logistic regression model for classification.
- Parameters
params (numpy.array vector of size \(d_1+1\)) – current parameter vector.
- initialization(Xc, w=1.0)[source]
Initialization of the parameters of the model. Initial parameters are chosen randomly and the dimension of parameter vector is the dimension of the covariates + 1 (intercept).
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
None
- fit(Xc, gamma, w=1.0, warm_start=True)[source]
Estimation of the parameters of the model given the covariates and the observed output. Note that the output does not need to be binary classes, it can consist in probability values.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False).
- Returns
None
- class LinearDiscriminantClassifier[source]
Linear Discriminant Analysis model for classification.
- Parameters
params (dict) – current parameters: pi is the class prior, mu_0 the mean vector for negative class, mu_1 the mean vector for positive class, Sigma the covariance matrix.
- initialization(Xc, w=1.0)[source]
Initialization of the parameters of the model:
the class prior pi is randomly and uniformly drawn in \([0,1]\)
the mean vectors mu_0 and mu_1 are drawn as standardized gaussian variables
the covariance matrix Sigma is initialized as the empirical covariance matrix of the whole data set.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- fit(Xc, gamma, w=1.0, warm_start=True)[source]
Estimation of the parameters of the model given the covariates and the observed output. Note that the output does not need to be binary classes, it can consist in probability values.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False). Not important here as the maximization is straightforward and does not depend on the initialization.
- Returns
None
- eta(Xc)[source]
Class probability predictions given the current parameters.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
class probabilities.
- Return type
numpy.array vector of size \(n\)
- logeta(Xc)[source]
Class log-probability predictions given the current parameters.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
class log-probabilities.
- Return type
numpy.array vector of size \(n\)
- pdf_pos(Xc)[source]
Individual likelihood for the positive distribution \(\mathbb{P}(x \vert Z=1)\) and for the current parameters.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
individual likelihood values.
- Return type
numpy.array vector of size \(n\)
- pdf_neg(Xc)[source]
Individual likelihood for the positive distribution \(\mathbb{P}(x \vert Z=0)\) and for the current parameters.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
individual likelihood values.
- Return type
numpy.array vector of size \(n\)
Propensity models
Three propensity models are provided in submodule pysarpu.propensity:
a Logistic Regression model
a logistic function with log-normal link function
a logistic function with Weibull link function