rdkit.ML.Data.SplitData module¶

rdkit.ML.Data.SplitData.SplitDataSet(data, frac, silent=0)¶

splits a data set into two pieces

Arguments

data: a list of examples to be split

frac: the fraction of the data to be put in the first data set

silent: controls the amount of visual noise produced.

Returns

a 2-tuple containing the two new data sets.

rdkit.ML.Data.SplitData.SplitDbData(conn, fracs, table='', fields='*', where='', join='', labelCol='', useActs=0, nActs=2, actCol='', actBounds=[], silent=0)¶

“splits” a data set held in a DB by returning lists of ids

Arguments:

conn: a DbConnect object

frac: the split fraction. This can optionally be specified as a sequence with a different fraction for each activity value.

table,fields,where,join: (optional) SQL query parameters

useActs: (optional) toggles splitting based on activities (ensuring that a given fraction of each activity class ends up in the hold-out set) Defaults to 0

nActs: (optional) number of possible activity values, only used if _useActs_ is nonzero Defaults to 2

actCol: (optional) name of the activity column Defaults to use the last column returned by the query

actBounds: (optional) sequence of activity bounds (for cases where the activity isn’t quantized in the db) Defaults to an empty sequence

silent: controls the amount of visual noise produced.

Usage:

Set up the db connection, the simple tables we’re using have actives with even ids and inactives with odd ids: >>> from rdkit.ML.Data import DataUtils >>> from rdkit.Dbase.DbConnection import DbConnect >>> from rdkit import RDConfig >>> conn = DbConnect(RDConfig.RDTestDatabase)

Pull a set of points from a simple table… take 33% of all points: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,1./3.,’basic_2class’) >>> [str(x) for x in train] [‘id-7’, ‘id-6’, ‘id-2’, ‘id-8’]

…take 50% of actives and 50% of inactives: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,.5,’basic_2class’,useActs=1) >>> [str(x) for x in train] [‘id-5’, ‘id-3’, ‘id-1’, ‘id-4’, ‘id-10’, ‘id-8’]

Notice how the results came out sorted by activity

We can be asymmetrical: take 33% of actives and 50% of inactives: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,[.5,1./3.],’basic_2class’,useActs=1) >>> [str(x) for x in train] [‘id-5’, ‘id-3’, ‘id-1’, ‘id-4’, ‘id-10’]

And we can pull from tables with non-quantized activities by providing activity quantization bounds: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,.5,’float_2class’,useActs=1,actBounds=[1.0]) >>> [str(x) for x in train] [‘id-5’, ‘id-3’, ‘id-1’, ‘id-4’, ‘id-10’, ‘id-8’]

rdkit.ML.Data.SplitData.SplitIndices(nPts, frac, silent=1, legacy=0, replacement=0)¶

splits a set of indices into a data set into 2 pieces

Arguments

nPts: the total number of points

frac: the fraction of the data to be put in the first data set

silent: (optional) toggles display of stats

legacy: (optional) use the legacy splitting approach

replacement: (optional) use selection with replacement

Returns

a 2-tuple containing the two sets of indices.

Notes

the _legacy_ splitting approach uses randomly-generated floats and compares them to _frac_. This is provided for backwards-compatibility reasons.

the default splitting approach uses a random permutation of indices which is split into two parts.

selection with replacement can generate duplicates.

Usage:

We’ll start with a set of indices and pick from them using the three different approaches: >>> from rdkit.ML.Data import DataUtils

The base approach always returns the same number of compounds in each set and has no duplicates: >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5) >>> test [1, 5, 6, 4, 2] >>> train [3, 0, 7, 8, 9]

>>> test,train = SplitIndices(10,.5)
>>> test
[5, 2, 9, 8, 7]
>>> train
[6, 0, 3, 1, 4]

The legacy approach can return varying numbers, but still has no duplicates. Note the indices come back ordered: >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5,legacy=1) >>> test [3, 5, 7, 8, 9] >>> train [0, 1, 2, 4, 6]

>>> test,train = SplitIndices(10,.5,legacy=1)
>>> test
[0, 1, 2, 3, 5, 8, 9]
>>> train
[4, 6, 7]

The replacement approach returns a fixed number in the training set, a variable number in the test set and can contain duplicates in the training set. >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5,replacement=1) >>> test [9, 9, 8, 0, 5] >>> train [1, 2, 3, 4, 6, 7] >>> test,train = SplitIndices(10,.5,replacement=1) >>> test [4, 5, 1, 1, 4] >>> train [0, 2, 3, 6, 7, 8, 9]

rdkit.ML.Data.SplitData module¶

Table of Contents

Previous topic

Next topic

This Page