Welcome to pysarpu’s documentation!
pysarpu
PU learning under SAR assumption with unknown propensity. Implementation of the general SAR-EM algorithm.
Free software: MIT license
Documentation: https://pysarpu.readthedocs.io.
Features
TODO
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
Installation
Stable release
To install pysarpu, run this command in your terminal:
$ pip install pysarpu
This is the preferred method to install pysarpu, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
From sources
The sources for pysarpu can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/ocoudray/pysarpu
Or download the tarball:
$ curl -OJL https://github.com/ocoudray/pysarpu/tarball/master
Once you have a copy of the source, you can install it with:
$ python setup.py install
Usage
Import
To use pysarpu in a project:
import pysarpu
PU learning classification model can be imported as follows:
from pysarpu import PU
The definition of a PU model requires the specification of a classification model and of a propensity model. Implementations can be found in sub-modules pysarpu.classification and pysarpu.propensity:
from pysarpu.classification import LinearLogisticRegression, LinearDiscriminantClassifier
from pysarpu.propensity import LogisticPropensity, LogProbitPropensity, GumbelPropensity
Modules
PU learning model
- class PUClassifier(cmodel, emodel, da=False)[source]
PU learning classification model under unknown propensity. This model works by specifying a model on the classification and on the propensity and estimates parameters using EM algorithm (SAR-EM, Bekker et al.)
- Parameters
cmodel (
pysarpu.classification.Classifier
) – an instance of class :class: Classifier <pysarpu.classification.Classifier> representing the classification model. This package includes two types of classification models: logistic regression (accessible throughpysarpu.classification.LinearLogisticRegression
) and linear discriminant analysis (accessible throughpysarpu.classification.LinearDiscriminantClassifier
)emodel (
pysarpu.propensity.Propensity
) – an instance of classpysarpu.propensity.Propensity
representing the propensity model. This package includes multiple pre-implemented propensity models: logistic propensity (pysarpu.propensity.LogisticPropensity
), log-normal propensity (pysarpu.propensity.LogProbitPropensity
) and Gumbel propensity (pysarpu.propensity.GumbelPropensity
)da (
bool
, optional) – whether the classification model is a discriminant analysis type model (True) or not (False). Indeed, the likelihood maximized is not the same in these two settings. Default: False.
- Returns
Return an instance of PU learning model (not yet initialized).
- Return type
- initialization(Xc, Xe, Y, w=1.0)[source]
Initialization of parameters for both classification and propensity models before running EM algorithm. The parameters of each models are initialized following their respective method: see initialization methods for cmodel and emodel.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification. The parameters of cmodel will be initialized in agreement with the dimension of the entry data \(d_1\).
Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity. The parameters of emodel will be initialized in agreement with the dimension of the entry data \(d_2\).
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
- Returns
None
- e(Xe)[source]
Propensity function using the current parameters of propensity model emodel.
- Parameters
Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity.
- Returns
vector of propensity scores.
- Return type
numpy.array of size \(n\)
- loge(Xe)[source]
Logarithm of propensity function using the current parameters of propensity model emodel.
- Parameters
Xe (numpy.array of shape \((n,d_2)\)) – covariate matrix for propensity.
- Returns
vector of log-propensity scores.
- Return type
numpy.array of size \(n\)
- predict_cproba(Xc)[source]
Class probability predictions using the parameters of the classification model.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
- Returns
posterior class probabilities.
- Return type
numpy.array vector of size \(n\)
- predict_clogproba(Xc)[source]
Class log-probability predictions using the parameters of the classification model.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
- Returns
posterior class log-probabilities.
- Return type
numpy.array vector of size \(n\)
- predict_c(Xc, threshold=0.5)[source]
Class binary predictions using the parameters of the classification model.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
threshold (float, optional (in \([0,1]\))) – decision threshold defining the decision rule.
- Returns
class predictions.
- Return type
numpy.array binary vector of size \(n\)
- predict_proba(Xc, Xe)[source]
Label probability predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_cproba which returns class probabilities instead.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
- Returns
posterior label probabilities.
- Return type
numpy.array vector of size \(n\)
- predict_logproba(Xc, Xe)[source]
Label log-probability predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_clogproba which returns class log-probabilities instead.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
- Returns
posterior label log-probabilities.
- Return type
numpy.array vector of size \(n\)
- predict(Xc, Xe, threshold=0.5)[source]
Label binary predictions based on the classification model cmodel and the propensity model emodel. Note that this is different from method predict_c which returns class predictions instead.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
threshold (float, optional (in \([0,1]\))) – decision threshold defining the decision rule.
- Returns
label binary predictions.
- Return type
numpy.array binary vector of size \(n\)
- loglikelihood(Xc, Xe, Y, w=1.0)[source]
Log-likelihood function given the current parameters of classification and propensity models. Note that the funciton returns the mean of individual dlog-likelihoods (instead of the usual sum).
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
- Returns
log-likelihood.
- Return type
float
- expectation(Xc, Xe, Y)[source]
Compute the expectation step of EM algorithm, return the probabilities for every instance to be of positive class given the observed labels.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
- Returns
posterior probabilities
- Return type
np.array vector of size \(n\)
- maximisation(Xc, Xe, Y, gamma, w=1.0, warm_start=True, balance=False)[source]
Compute the maximisation step of EM algorithm, update the model parameters in both classification and propensity models.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
- Returns
None
- fit(Xc, Xe, Y, w=1.0, tol=1e-06, max_iter=10000.0, warm_start=False, balance=False, n_init=20, iter_init=20)[source]
Estimation of PU learning model parameters (classifier and propensity) through EM algorithm. Multiple random initialization are considered and trained over a few iterations. Then, only the one achieving the best log-likelihood is considered and trained until convergence.
- Parameters
Xc (numpy.array with shape \((n, d_1)\).) – covariate matrix for classification.
Xe (numpy.array with shape \((n, d_2)\).) – covariate matrix for propensity.
Y (numpy.array vector of size \(n\).) – observed labels. Only used in the computation of the initial log-likelihood.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
tol (float, optional) – tolerance parameter. Once the increase in the log-likelihood is below tol, the algorithm stops (default 1e-6).
max_iter (int, optional) – maximum number of iterations (default: 1e4)
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False).
balance (bool, optional) – re-balance weights when fitting the propensity model in the maximization (experimental, potentially interesting in highly unbalanced situations). Default: False.
n_init (int, optional) – number of initialization to consider in the Small EM initialization strategy (default: n_init=20)
iter_init (int, optional) – maximum number of iterations to consider for each initialization (default: 20).
- Returns
None
Classification models
Two classification models can be found in submodule pysarpu.classification:
a Linear Logistic Regression model
a Linear Discriminant Analysis model
These two models inherit from the general class sklearn.classification.Classifier.
- class LinearLogisticRegression[source]
Linear logistic regression model for classification.
- Parameters
params (numpy.array vector of size \(d_1+1\)) – current parameter vector.
- initialization(Xc, w=1.0)[source]
Initialization of the parameters of the model. Initial parameters are chosen randomly and the dimension of parameter vector is the dimension of the covariates + 1 (intercept).
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
None
- fit(Xc, gamma, w=1.0, warm_start=True)[source]
Estimation of the parameters of the model given the covariates and the observed output. Note that the output does not need to be binary classes, it can consist in probability values.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False).
- Returns
None
- class LinearDiscriminantClassifier[source]
Linear Discriminant Analysis model for classification.
- Parameters
params (dict) – current parameters: pi is the class prior, mu_0 the mean vector for negative class, mu_1 the mean vector for positive class, Sigma the covariance matrix.
- initialization(Xc, w=1.0)[source]
Initialization of the parameters of the model:
the class prior pi is randomly and uniformly drawn in \([0,1]\)
the mean vectors mu_0 and mu_1 are drawn as standardized gaussian variables
the covariance matrix Sigma is initialized as the empirical covariance matrix of the whole data set.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- fit(Xc, gamma, w=1.0, warm_start=True)[source]
Estimation of the parameters of the model given the covariates and the observed output. Note that the output does not need to be binary classes, it can consist in probability values.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
gamma (numpy.array of size \(n\)) – posterior probabilities obtained in the expectation step.
w (either float (1., default) or numpy.array of size \(n\), optional.) – individual weights (experimental, not tested). Apply weights to observations in the computation of the likelihood.
warm_start (bool, optional) – indicates whether current parameters can be used for initialization (True) or if they should be re-initialized before estimation (default False). Not important here as the maximization is straightforward and does not depend on the initialization.
- Returns
None
- eta(Xc)[source]
Class probability predictions given the current parameters.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
class probabilities.
- Return type
numpy.array vector of size \(n\)
- logeta(Xc)[source]
Class log-probability predictions given the current parameters.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
class log-probabilities.
- Return type
numpy.array vector of size \(n\)
- pdf_pos(Xc)[source]
Individual likelihood for the positive distribution \(\mathbb{P}(x \vert Z=1)\) and for the current parameters.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
individual likelihood values.
- Return type
numpy.array vector of size \(n\)
- pdf_neg(Xc)[source]
Individual likelihood for the positive distribution \(\mathbb{P}(x \vert Z=0)\) and for the current parameters.
- Parameters
Xc (numpy.array of shape \((n,d_1)\)) – covariate matrix for classification.
- Returns
individual likelihood values.
- Return type
numpy.array vector of size \(n\)
Propensity models
Three propensity models are provided in submodule pysarpu.propensity:
a Logistic Regression model
a logistic function with log-normal link function
a logistic function with Weibull link function
Contributing
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions
Report Bugs
Report bugs at https://github.com/ocoudray/pysarpu/issues.
If you are reporting a bug, please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting.
Detailed steps to reproduce the bug.
Fix Bugs
Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.
Implement Features
Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.
Write Documentation
pysarpu could always use more documentation, whether as part of the official pysarpu docs, in docstrings, or even on the web in blog posts, articles, and such.
Submit Feedback
The best way to send feedback is to file an issue at https://github.com/ocoudray/pysarpu/issues.
If you are proposing a feature:
Explain in detail how it would work.
Keep the scope as narrow as possible, to make it easier to implement.
Remember that this is a volunteer-driven project, and that contributions are welcome :)
Get Started!
Ready to contribute? Here’s how to set up pysarpu for local development.
Fork the pysarpu repo on GitHub.
Clone your fork locally:
$ git clone git@github.com:your_name_here/pysarpu.git
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
$ mkvirtualenv pysarpu $ cd pysarpu/ $ python setup.py develop
Create a branch for local development:
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 pysarpu tests $ python setup.py test or pytest $ tox
To get flake8 and tox, just pip install them into your virtualenv.
Commit your changes and push your branch to GitHub:
$ git add . $ git commit -m "Your detailed description of your changes." $ git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
Pull Request Guidelines
Before you submit a pull request, check that it meets these guidelines:
The pull request should include tests.
If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
The pull request should work for Python 3.5, 3.6, 3.7 and 3.8, and for PyPy. Check https://travis-ci.com/ocoudray/pysarpu/pull_requests and make sure that the tests pass for all supported Python versions.
Tips
To run a subset of tests:
$ python -m unittest tests.test_pysarpu
Deploying
A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.rst). Then run:
$ bump2version patch # possible: major / minor / patch
$ git push
$ git push --tags
Travis will then deploy to PyPI if tests pass.
Credits
Development Lead
Olivier COUDRAY <olivier.coudray.15@polytechnique.org>
Contributors
None yet. Why not be the first?
History
0.1.0 (2022-10-07)
First release on PyPI.