rdkit.ML.Data.SplitData module

rdkit.ML.Data.SplitData.SplitDataSet(data, frac, silent=0)

splits a data set into two pieces

Arguments

  • data: a list of examples to be split

  • frac: the fraction of the data to be put in the first data set

  • silent: controls the amount of visual noise produced.

Returns

a 2-tuple containing the two new data sets.

rdkit.ML.Data.SplitData.SplitDbData(conn, fracs, table='', fields='*', where='', join='', labelCol='', useActs=0, nActs=2, actCol='', actBounds=[], silent=0)

“splits” a data set held in a DB by returning lists of ids

Arguments:

  • conn: a DbConnect object

  • frac: the split fraction. This can optionally be specified as a sequence with a different fraction for each activity value.

  • table,fields,where,join: (optional) SQL query parameters

  • useActs: (optional) toggles splitting based on activities (ensuring that a given fraction of each activity class ends up in the hold-out set) Defaults to 0

  • nActs: (optional) number of possible activity values, only used if _useActs_ is nonzero Defaults to 2

  • actCol: (optional) name of the activity column Defaults to use the last column returned by the query

  • actBounds: (optional) sequence of activity bounds (for cases where the activity isn’t quantized in the db) Defaults to an empty sequence

  • silent: controls the amount of visual noise produced.

Usage:

Set up the db connection, the simple tables we’re using have actives with even ids and inactives with odd ids: >>> from rdkit.ML.Data import DataUtils >>> from rdkit.Dbase.DbConnection import DbConnect >>> from rdkit import RDConfig >>> conn = DbConnect(RDConfig.RDTestDatabase)

Pull a set of points from a simple table… take 33% of all points: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,1./3.,’basic_2class’) >>> [str(x) for x in train] [‘id-7’, ‘id-6’, ‘id-2’, ‘id-8’]

…take 50% of actives and 50% of inactives: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,.5,’basic_2class’,useActs=1) >>> [str(x) for x in train] [‘id-5’, ‘id-3’, ‘id-1’, ‘id-4’, ‘id-10’, ‘id-8’]

Notice how the results came out sorted by activity

We can be asymmetrical: take 33% of actives and 50% of inactives: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,[.5,1./3.],’basic_2class’,useActs=1) >>> [str(x) for x in train] [‘id-5’, ‘id-3’, ‘id-1’, ‘id-4’, ‘id-10’]

And we can pull from tables with non-quantized activities by providing activity quantization bounds: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,.5,’float_2class’,useActs=1,actBounds=[1.0]) >>> [str(x) for x in train] [‘id-5’, ‘id-3’, ‘id-1’, ‘id-4’, ‘id-10’, ‘id-8’]

rdkit.ML.Data.SplitData.SplitIndices(nPts, frac, silent=1, legacy=0, replacement=0)

splits a set of indices into a data set into 2 pieces

Arguments

  • nPts: the total number of points

  • frac: the fraction of the data to be put in the first data set

  • silent: (optional) toggles display of stats

  • legacy: (optional) use the legacy splitting approach

  • replacement: (optional) use selection with replacement

Returns

a 2-tuple containing the two sets of indices.

Notes

  • the _legacy_ splitting approach uses randomly-generated floats and compares them to _frac_. This is provided for backwards-compatibility reasons.

  • the default splitting approach uses a random permutation of indices which is split into two parts.

  • selection with replacement can generate duplicates.

Usage:

We’ll start with a set of indices and pick from them using the three different approaches: >>> from rdkit.ML.Data import DataUtils

The base approach always returns the same number of compounds in each set and has no duplicates: >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5) >>> test [1, 5, 6, 4, 2] >>> train [3, 0, 7, 8, 9]

>>> test,train = SplitIndices(10,.5)
>>> test
[5, 2, 9, 8, 7]
>>> train
[6, 0, 3, 1, 4]

The legacy approach can return varying numbers, but still has no duplicates. Note the indices come back ordered: >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5,legacy=1) >>> test [3, 5, 7, 8, 9] >>> train [0, 1, 2, 4, 6]

>>> test,train = SplitIndices(10,.5,legacy=1)
>>> test
[0, 1, 2, 3, 5, 8, 9]
>>> train
[4, 6, 7]

The replacement approach returns a fixed number in the training set, a variable number in the test set and can contain duplicates in the training set. >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5,replacement=1) >>> test [9, 9, 8, 0, 5] >>> train [1, 2, 3, 4, 6, 7] >>> test,train = SplitIndices(10,.5,replacement=1) >>> test [4, 5, 1, 1, 4] >>> train [0, 2, 3, 6, 7, 8, 9]