rdkit.ML.InfoTheory.rdInfoTheory module¶

Module containing bunch of functions for information metrics and a ranker to rank bits

class rdkit.ML.InfoTheory.rdInfoTheory.BitCorrMatGenerator((object)arg1)¶

Bases: instance

A class to generate a pairwise correlation matrix between a list of bits The mode of operation for this class is something like this

>>> cmg = BitCorrMatGenerator()
>>> cmg.SetBitList(blist)
>>> for fp in fpList:
>>>    cmg.CollectVotes(fp)
>>> corrMat = cmg.GetCorrMatrix()
The resulting correlation matrix is a one dimensional nummeric array containing the lower triangle elements

C++ signature :: void __init__(_object*)

CollectVotes((BitCorrMatGenerator)self, (AtomPairsParameters)bitVect) → None :¶

For each pair of on bits (bi, bj) in fp increase the correlation count for the pair by 1

Parameters:: fp (-) – a bit vector to collect the fingerprints from

C++ signature :: void CollectVotes(RDInfoTheory::BitCorrMatGenerator*,boost::python::api::object)

GetCorrMatrix((BitCorrMatGenerator)self) → object :¶

Get the correlation matrix following the collection of votes from a bunch of fingerprints

C++ signature :: _object* GetCorrMatrix(RDInfoTheory::BitCorrMatGenerator*)

SetBitList((BitCorrMatGenerator)self, (AtomPairsParameters)bitList) → None :¶

Set the list of bits that need to be correllated

This may for example be their top ranking ensemble bits

Parameters:: bitList (-) – an integer list of bit IDs

C++ signature :: void SetBitList(RDInfoTheory::BitCorrMatGenerator*,boost::python::api::object)

rdkit.ML.InfoTheory.rdInfoTheory.ChiSquare((AtomPairsParameters)resArr) → float :¶

Calculates the chi squared value for a variable

ARGUMENTS:

varMat: a Numeric Array object varMat is a Numeric array with the number of possible occurrences

of each result for reach possible value of the given variable.

So, for a variable which adopts 4 possible values and a result which
has 3 possible values, varMat would be 4x3

RETURNS:

a Python float object

C++ signature :: double ChiSquare(boost::python::api::object)

class rdkit.ML.InfoTheory.rdInfoTheory.InfoBitRanker((object)self, (int)nBits, (int)nClasses)¶

Bases: instance

A class to rank the bits from a series of labelled fingerprints A simple demonstration may help clarify what this class does. Here’s a small set of vectors:

>>> for i,bv in enumerate(bvs): print(bv.ToBitString(),acts[i])
...
0001 0
0101 0
0010 1
1110 1

Default ranker, using infogain:

>>> ranker = InfoBitRanker(4,2)
>>> for i,bv in enumerate(bvs): ranker.AccumulateVotes(bv,acts[i])
...
>>> for bit,gain,n0,n1 in ranker.GetTopN(3): print(int(bit),'%.3f'%gain,int(n0),int(n1))
...
3 1.000 2 0
2 1.000 0 2
0 0.311 0 1

Using the biased infogain:

>>> ranker = InfoBitRanker(4,2,InfoTheory.InfoType.BIASENTROPY)
>>> ranker.SetBiasList((1,))
>>> for i,bv in enumerate(bvs): ranker.AccumulateVotes(bv,acts[i])
...
>>> for bit,gain,n0,n1 in ranker.GetTopN(3): print(int(bit),'%.3f'%gain,int(n0),int(n1))
...
2 1.000 0 2
0 0.311 0 1
1 0.000 1 1

A chi squared ranker is also available:

>>> ranker = InfoBitRanker(4,2,InfoTheory.InfoType.CHISQUARE)
>>> for i,bv in enumerate(bvs): ranker.AccumulateVotes(bv,acts[i])
...
>>> for bit,gain,n0,n1 in ranker.GetTopN(3): print(int(bit),'%.3f'%gain,int(n0),int(n1))
...
3 4.000 2 0
2 4.000 0 2
0 1.333 0 1

As is a biased chi squared:

>>> ranker = InfoBitRanker(4,2,InfoTheory.InfoType.BIASCHISQUARE)
>>> ranker.SetBiasList((1,))
>>> for i,bv in enumerate(bvs): ranker.AccumulateVotes(bv,acts[i])
...
>>> for bit,gain,n0,n1 in ranker.GetTopN(3): print(int(bit),'%.3f'%gain,int(n0),int(n1))
...
2 4.000 0 2
0 1.333 0 1
1 0.000 1 1

C++ signature :
void __init__(_object*,int,int)

__init__( (object)self, (int)nBits, (int)nClasses, (InfoType)infoType) -> None :

C++ signature :
void __init__(_object*,int,int,RDInfoTheory::InfoBitRanker::InfoType)

AccumulateVotes((InfoBitRanker)self, (AtomPairsParameters)bitVect, (int)label) → None :¶

Accumulate the votes for all the bits turned on in a bit vector

Parameters:

bv (-) – bit vector either ExplicitBitVect or SparseBitVect operator
label (-) – the class label for the bit vector. It is assumed that 0 <= class < nClasses

C++ signature :: void AccumulateVotes(RDInfoTheory::InfoBitRanker*,boost::python::api::object,int)

GetTopN((InfoBitRanker)self, (int)num) → object :¶

Returns the top n bits ranked by the information metric This is actually the function where most of the work of ranking is happening

Parameters:: num (-) – the number of top ranked bits that are required

C++ signature :: _object* GetTopN(RDInfoTheory::InfoBitRanker*,int)

SetBiasList((InfoBitRanker)self, (AtomPairsParameters)classList) → None :¶

Set the classes to which the entropy calculation should be biased

This list contains a set of class ids used when in the BIASENTROPY mode of ranking bits. In this mode, a bit must be correlated higher with one of the biased classes than all the other classes. For example, in a two class problem with actives and inactives, the fraction of actives that hit the bit has to be greater than the fraction of inactives that hit the bit

Parameters:: classList (-) – list of class ids that we want a bias towards

C++ signature :: void SetBiasList(RDInfoTheory::InfoBitRanker*,boost::python::api::object)

SetMaskBits((InfoBitRanker)self, (AtomPairsParameters)maskBits) → None :¶

Set the mask bits for the calculation

Parameters:: maskBits (-) – list of mask bits to use

C++ signature :: void SetMaskBits(RDInfoTheory::InfoBitRanker*,boost::python::api::object)

Tester((InfoBitRanker)self, (AtomPairsParameters)bitVect) → None :¶

C++ signature :: void Tester(RDInfoTheory::InfoBitRanker*,boost::python::api::object)

WriteTopBitsToFile((InfoBitRanker)self, (str)fileName) → None :¶

Write the bits that have been ranked to a file

C++ signature :: void WriteTopBitsToFile(RDInfoTheory::InfoBitRanker {lvalue},std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)

rdkit.ML.InfoTheory.rdInfoTheory.InfoEntropy((AtomPairsParameters)resArr) → float :¶

calculates the informational entropy of the values in an array

ARGUMENTS:

resMat: pointer to a long int array containing the data

dim: long int containing the length of the _tPtr_ array.

RETURNS:

a double

C++ signature :: double InfoEntropy(boost::python::api::object)

rdkit.ML.InfoTheory.rdInfoTheory.InfoGain((AtomPairsParameters)resArr) → float :¶

Calculates the information gain for a variable

ARGUMENTS:

varMat: a Numeric Array object varMat is a Numeric array with the number of possible occurrences

of each result for reach possible value of the given variable.

So, for a variable which adopts 4 possible values and a result which
has 3 possible values, varMat would be 4x3

RETURNS:

a Python float object

NOTES

this is a dropin replacement for _PyInfoGain()_ in entropy.py

C++ signature :: double InfoGain(boost::python::api::object)

class rdkit.ML.InfoTheory.rdInfoTheory.InfoType¶

Bases: enum

BIASCHISQUARE = rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASCHISQUARE¶

BIASENTROPY = rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASENTROPY¶

CHISQUARE = rdkit.ML.InfoTheory.rdInfoTheory.InfoType.CHISQUARE¶

ENTROPY = rdkit.ML.InfoTheory.rdInfoTheory.InfoType.ENTROPY¶

names = {'BIASCHISQUARE': rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASCHISQUARE, 'BIASENTROPY': rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASENTROPY, 'CHISQUARE': rdkit.ML.InfoTheory.rdInfoTheory.InfoType.CHISQUARE, 'ENTROPY': rdkit.ML.InfoTheory.rdInfoTheory.InfoType.ENTROPY}¶

values = {1: rdkit.ML.InfoTheory.rdInfoTheory.InfoType.ENTROPY, 2: rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASENTROPY, 3: rdkit.ML.InfoTheory.rdInfoTheory.InfoType.CHISQUARE, 4: rdkit.ML.InfoTheory.rdInfoTheory.InfoType.BIASCHISQUARE}¶

rdkit.ML.InfoTheory.rdInfoTheory module¶

Table of Contents

Previous topic

Next topic

This Page