Package ML :: Module BuildComposite
[hide private]
[frames] | no frames]

Source Code for Module ML.BuildComposite

   1  # $Id: BuildComposite.py 742 2008-07-05 07:42:38Z glandrum $ 
   2  # 
   3  #  Copyright (C) 2000-2008  greg Landrum and Rational Discovery LLC 
   4  # 
   5  #   @@ All Rights Reserved  @@ 
   6  # 
   7  """ command line utility for building composite models 
   8   
   9  #DOC 
  10   
  11  **Usage** 
  12   
  13    BuildComposite [optional args] filename 
  14   
  15  Unless indicated otherwise (via command line arguments), _filename_ is 
  16  a QDAT file. 
  17   
  18  **Command Line Arguments** 
  19   
  20    - -o *filename*: name of the output file for the pickled composite 
  21   
  22    - -n *num*: number of separate models to add to the composite 
  23   
  24    - -p *tablename*: store persistence data in the database 
  25       in table *tablename* 
  26   
  27    - -N *note*: attach some arbitrary text to the persistence data 
  28   
  29    - -b *filename*: name of the text file to hold examples from the 
  30       holdout set which are misclassified 
  31   
  32    - -s: split the data into training and hold-out sets before building 
  33       the composite 
  34   
  35    - -f *frac*: the fraction of data to use in the training set when the 
  36       data is split 
  37   
  38    - -r: randomize the activities (for testing purposes).  This ignores 
  39       the initial distribution of activity values and produces each 
  40       possible activity value with equal likliehood. 
  41   
  42    - -S: shuffle the activities (for testing purposes) This produces 
  43       a permutation of the input activity values. 
  44   
  45    - -l: locks the random number generator to give consistent sets 
  46       of training and hold-out data.  This is primarily intended 
  47       for testing purposes. 
  48   
  49    - -B: use a so-called Bayesian composite model. 
  50   
  51    - -d *database name*: instead of reading the data from a QDAT file, 
  52       pull it from a database.  In this case, the _filename_ argument 
  53       provides the name of the database table containing the data set. 
  54   
  55    - -D: show a detailed breakdown of the composite model performance 
  56       across the training and, when appropriate, hold-out sets. 
  57        
  58    - -P *pickle file name*: write out the pickled data set to the file 
  59   
  60    - -F *filter frac*: filters the data before training to change the 
  61       distribution of activity values in the training set.  *filter 
  62       frac* is the fraction of the training set that should have the 
  63       target value.  **See note below on data filtering.** 
  64   
  65    - -v *filter value*: filters the data before training to change the 
  66       distribution of activity values in the training set. *filter 
  67       value* is the target value to use in filtering.  **See note below 
  68       on data filtering.** 
  69        
  70    - --modelFiltFrac *model filter frac*: Similar to filter frac above, 
  71       in this case the data is filtered for each model in the composite 
  72       rather than a single overall filter for a composite. *model 
  73       filter frac* is the fraction of the training set for each model 
  74       that should have the target value (*model filter value*). 
  75   
  76    - --modelFiltVal *model filter value*: target value to use for 
  77       filtering data before training each model in the composite. 
  78        
  79    - -t *threshold value*: use high-confidence predictions for the 
  80       final analysis of the hold-out data. 
  81   
  82    - -Q *list string*: the values of quantization bounds for the 
  83       activity value.  See the _-q_ argument for the format of *list 
  84       string*. 
  85   
  86    - --nRuns *count*: build *count* composite models 
  87   
  88    - --prune: prune any models built 
  89   
  90    - -h: print a usage message and exit. 
  91   
  92    - -V: print the version number and exit 
  93   
  94    *-*-*-*-*-*-*-*- Tree-Related Options -*-*-*-*-*-*-*-* 
  95   
  96    - -g: be less greedy when training the models. 
  97   
  98    - -G *number*: force trees to be rooted at descriptor *number*. 
  99   
 100    - -L *limit*: provide an (integer) limit on individual model 
 101       complexity 
 102   
 103    - -q *list string*: Add QuantTrees to the composite and use the list 
 104       specified in *list string* as the number of target quantization 
 105       bounds for each descriptor.  Don't forget to include 0's at the 
 106       beginning and end of *list string* for the name and value fields. 
 107       For example, if there are 4 descriptors and you want 2 quant 
 108       bounds apiece, you would use _-q "[0,2,2,2,2,0]"_. 
 109       Two special cases: 
 110         1) If you would like to ignore a descriptor in the model 
 111            building, use '-1' for its number of quant bounds. 
 112         2) If you have integer valued data that should not be quantized 
 113            further, enter 0 for that descriptor. 
 114   
 115    - --recycle: allow descriptors to be used more than once in a tree         
 116   
 117    - --randomDescriptors=val: toggles growing random forests with val 
 118        randomly-selected descriptors available at each node. 
 119   
 120   
 121    *-*-*-*-*-*-*-*- KNN-Related Options -*-*-*-*-*-*-*-* 
 122   
 123    - --doKnn: use K-Nearest Neighbors models 
 124   
 125    - --knnK=*value*: the value of K to use in the KNN models 
 126   
 127    - --knnTanimoto: use the Tanimoto metric in KNN models 
 128     
 129    - --knnEuclid: use a Euclidean metric in KNN models 
 130     
 131    *-*-*-*-*-*-*- Naive Bayes Classifier Options -*-*-*-*-*-*-*-* 
 132    - --doNaiveBayes : use Naive Bayes classifiers 
 133     
 134    - --mEstimateVal : the value to be used in the m-estimate formula 
 135        If this is greater than 0.0, we use it to compute the conditional 
 136        probabilities by the m-estimate 
 137   
 138    *-*-*-*-*-*-*-*- SVM-Related Options -*-*-*-*-*-*-*-* 
 139   
 140    **** NOTE: THESE ARE DISABLED ****   
 141   
 142  ##   - --doSVM: use Support-vector machines 
 143   
 144  ##   - --svmKernel=*kernel*: choose the type of kernel to be used for 
 145  ##     the SVMs.  Options are: 
 146  ##     The default is: 
 147   
 148  ##   - --svmType=*type*: choose the type of support-vector machine 
 149  ##     to be used.  Options are: 
 150  ##     The default is: 
 151   
 152  ##   - --svmGamma=*gamma*: provide the gamma value for the SVMs.  If this 
 153  ##     is not provided, a grid search will be carried out to determine an 
 154  ##     optimal *gamma* value for each SVM. 
 155       
 156  ##   - --svmCost=*cost*: provide the cost value for the SVMs.  If this is 
 157  ##     not provided, a grid search will be carried out to determine an 
 158  ##     optimal *cost* value for each SVM. 
 159   
 160  ##   - --svmWeights=*weights*: provide the weight values for the 
 161  ##     activities.  If provided this should be a sequence of (label, 
 162  ##     weight) 2-tuples *nActs* long.  If not provided, a weight of 1 
 163  ##     will be used for each activity. 
 164   
 165  ##   - --svmEps=*epsilon*: provide the epsilon value used to determine 
 166  ##     when the SVM has converged.  Defaults to 0.001 
 167       
 168  ##   - --svmDegree=*degree*: provide the degree of the kernel (when 
 169  ##     sensible) Defaults to 3 
 170   
 171  ##   - --svmCoeff=*coeff*: provide the coefficient for the kernel (when 
 172  ##     sensible) Defaults to 0 
 173       
 174  ##   - --svmNu=*nu*: provide the nu value for the kernel (when sensible) 
 175  ##     Defaults to 0.5 
 176   
 177  ##   - --svmDataType=*float*: if the data is contains only 1 and 0 s, specify by 
 178  ##     using binary. Defaults to float 
 179       
 180  ##   - --svmCache=*cache*: provide the size of the memory cache (in MB) 
 181  ##     to be used while building the SVM.  Defaults to 40 
 182   
 183  **Notes** 
 184   
 185    - *Data filtering*: When there is a large disparity between the 
 186      numbers of points with various activity levels present in the 
 187      training set it is sometimes desirable to train on a more 
 188      homogeneous data set.  This can be accomplished using filtering. 
 189      The filtering process works by selecting a particular target 
 190      fraction and target value.  For example, in a case where 95% of 
 191      the original training set has activity 0 and ony 5% activity 1, we 
 192      could filter (by randomly removing points with activity 0) so that 
 193      30% of the data set used to build the composite has activity 1. 
 194        
 195   
 196  """ 
 197  import RDConfig 
 198  from utils import listutils 
 199  from ML.Composite import Composite,BayesComposite 
 200  #from ML.SVM import SVMClassificationModel as SVM 
 201  import numpy 
 202  import math 
 203  from ML.Data import DataUtils,SplitData 
 204  from ML import ScreenComposite 
 205  from Dbase import DbModule 
 206  from Dbase.DbConnection import DbConnect 
 207  from ML import CompositeRun 
 208  import sys,cPickle,time 
 209  import DataStructs 
 210   
 211  _runDetails = CompositeRun.CompositeRun() 
 212   
 213  __VERSION_STRING="3.2.3" 
 214   
 215  _verbose = 1 
216 -def message(msg):
217 """ emits messages to _sys.stdout_ 218 override this in modules which import this one to redirect output 219 220 **Arguments** 221 222 - msg: the string to be displayed 223 224 """ 225 if _verbose: sys.stdout.write('%s\n'%(msg))
226 227
228 -def testall(composite,examples,badExamples=[]):
229 """ screens a number of examples past a composite 230 231 **Arguments** 232 233 - composite: a composite model 234 235 - examples: a list of examples (with results) to be screened 236 237 - badExamples: a list to which misclassified examples are appended 238 239 **Returns** 240 241 a list of 2-tuples containing: 242 243 1) a vote 244 245 2) a confidence 246 247 these are the votes and confidence levels for **misclassified** examples 248 249 """ 250 wrong = [] 251 for example in examples: 252 if composite.GetActivityQuantBounds(): 253 answer = composite.QuantizeActivity(example)[-1] 254 else: 255 answer = example[-1] 256 res,conf = composite.ClassifyExample(example) 257 if res != answer: 258 wrong.append((res,conf)) 259 badExamples.append(example) 260 261 return wrong
262
263 -def GetCommandLine(details):
264 """ #DOC 265 266 """ 267 args = ['BuildComposite'] 268 args.append('-n %d'%(details.nModels)) 269 if details.filterFrac != 0.0: args.append('-F %.3f -v %d'%(details.filterFrac,details.filterVal)) 270 if details.modelFilterFrac != 0.0: args.append('--modelFiltFrac=%.3f --modelFiltVal=%d'%(details.modelFilterFrac, 271 details.modelFilterVal)) 272 if details.splitRun: args.append('-s -f %.3f'%(details.splitFrac)) 273 if details.shuffleActivities: args.append('-S') 274 if details.randomActivities: args.append('-r') 275 if details.threshold > 0.0: args.append('-t %.3f'%(details.threshold)) 276 if details.activityBounds: args.append('-Q "%s"'%(details.activityBoundsVals)) 277 if details.dbName: args.append('-d %s'%(details.dbName)) 278 if details.detailedRes: args.append('-D') 279 if hasattr(details,'noScreen') and details.noScreen: args.append('--noScreen') 280 if details.persistTblName and details.dbName: 281 args.append('-p %s'%(details.persistTblName)) 282 if details.note: 283 args.append('-N %s'%(details.note)) 284 if details.useTrees: 285 if details.limitDepth>0: args.append('-L %d'%(details.limitDepth)) 286 if details.lessGreedy: args.append('-g') 287 if details.qBounds: 288 shortBounds = listutils.CompactListRepr(details.qBounds) 289 if details.qBounds: args.append('-q "%s"'%(shortBounds)) 290 else: 291 if details.qBounds: args.append('-q "%s"'%(details.qBoundCount)) 292 293 if details.pruneIt: args.append('--prune') 294 if details.startAt: args.append('-G %d'%details.startAt) 295 if details.recycleVars: args.append('--recycle') 296 if details.randomDescriptors: args.append('--randomDescriptors=%d'%details.randomDescriptors) 297 if details.useSigTrees: 298 args.append('--doSigTree') 299 if details.limitDepth>0: args.append('-L %d'%(details.limitDepth)) 300 if details.randomDescriptors: 301 args.append('--randomDescriptors=%d'%details.randomDescriptors) 302 303 if details.useKNN: 304 args.append('--doKnn --knnK %d'%(details.knnNeighs)) 305 if details.knnDistFunc=='Tanimoto': 306 args.append('--knnTanimoto') 307 else: 308 args.append('--knnEuclid') 309 310 if details.useNaiveBayes: 311 args.append('--doNaiveBayes') 312 if details.mEstimateVal >= 0.0 : 313 args.append('--mEstimateVal=%.3f'%details.mEstimateVal) 314 315 ## if details.useSVM: 316 ## args.append('--doSVM') 317 ## if details.svmKernel: 318 ## for k in SVM.kernels.keys(): 319 ## if SVM.kernels[k]==details.svmKernel: 320 ## args.append('--svmKernel=%s'%k) 321 ## break 322 ## if details.svmType: 323 ## for k in SVM.machineTypes.keys(): 324 ## if SVM.machineTypes[k]==details.svmType: 325 ## args.append('--svmType=%s'%k) 326 ## break 327 ## if details.svmGamma: 328 ## args.append('--svmGamma=%f'%details.svmGamma) 329 ## if details.svmCost: 330 ## args.append('--svmCost=%f'%details.svmCost) 331 ## if details.svmWeights: 332 ## args.append("--svmWeights='%s'"%str(details.svmWeights)) 333 ## if details.svmDegree: 334 ## args.append('--svmDegree=%d'%details.svmDegree) 335 ## if details.svmCoeff: 336 ## args.append('--svmCoeff=%d'%details.svmCoeff) 337 ## if details.svmEps: 338 ## args.append('--svmEps=%f'%details.svmEps) 339 ## if details.svmNu: 340 ## args.append('--svmNu=%f'%details.svmNu) 341 ## if details.svmCache: 342 ## args.append('--svmCache=%d'%details.svmCache) 343 ## if detail.svmDataType: 344 ## args.append('--svmDataType=%s'%details.svmDataType) 345 ## if not details.svmShrink: 346 ## args.append('--svmShrink') 347 348 if details.replacementSelection: args.append('--replacementSelection') 349 350 351 # this should always be last: 352 if details.tableName: args.append(details.tableName) 353 354 return ' '.join(args)
355
356 -def RunOnData(details,data,progressCallback=None,saveIt=1,setDescNames=0):
357 nExamples = data.GetNPts() 358 if details.lockRandom: 359 seed = details.randomSeed 360 else: 361 import random 362 seed = (random.randint(0,1e6),random.randint(0,1e6)) 363 DataUtils.InitRandomNumbers(seed) 364 testExamples = [] 365 if details.shuffleActivities == 1: 366 DataUtils.RandomizeActivities(data,shuffle=1,runDetails=details) 367 elif details.randomActivities == 1: 368 DataUtils.RandomizeActivities(data,shuffle=0,runDetails=details) 369 370 namedExamples = data.GetNamedData() 371 if details.splitRun == 1: 372 trainIdx,testIdx = SplitData.SplitIndices(len(namedExamples),details.splitFrac, 373 silent=not _verbose) 374 375 trainExamples = [namedExamples[x] for x in trainIdx] 376 testExamples = [namedExamples[x] for x in testIdx] 377 else: 378 testExamples = [] 379 testIdx = [] 380 trainIdx = range(len(namedExamples)) 381 trainExamples = namedExamples 382 383 if details.filterFrac != 0.0: 384 # if we're doing quantization on the fly, we need to handle that here: 385 if hasattr(details,'activityBounds') and details.activityBounds: 386 tExamples = [] 387 bounds = details.activityBounds 388 for pt in trainExamples: 389 pt = pt[:] 390 act = pt[-1] 391 placed=0 392 bound=0 393 while not placed and bound < len(bounds): 394 if act < bounds[bound]: 395 pt[-1] = bound 396 placed = 1 397 else: 398 bound += 1 399 if not placed: 400 pt[-1] = bound 401 tExamples.append(pt) 402 else: 403 bounds = None 404 tExamples = trainExamples 405 trainIdx,temp = DataUtils.FilterData(tExamples,details.filterVal, 406 details.filterFrac,-1, 407 indicesOnly=1) 408 tmp = [trainExamples[x] for x in trainIdx] 409 testExamples += [trainExamples[x] for x in temp] 410 trainExamples = tmp 411 412 counts = DataUtils.CountResults(trainExamples,bounds=bounds) 413 ks = counts.keys() 414 ks.sort() 415 message('Result Counts in training set:') 416 for k in ks: 417 message(str((k, counts[k]))) 418 counts = DataUtils.CountResults(testExamples,bounds=bounds) 419 ks = counts.keys() 420 ks.sort() 421 message('Result Counts in test set:') 422 for k in ks: 423 message(str((k, counts[k]))) 424 nExamples = len(trainExamples) 425 message('Training with %d examples'%(nExamples)) 426 427 nVars = data.GetNVars() 428 attrs = range(1,nVars+1) 429 nPossibleVals = data.GetNPossibleVals() 430 for i in range(1,len(nPossibleVals)): 431 if nPossibleVals[i-1] == -1: 432 attrs.remove(i) 433 434 if details.pickleDataFileName != '': 435 pickleDataFile = open(details.pickleDataFileName,'wb+') 436 cPickle.dump(trainExamples,pickleDataFile) 437 cPickle.dump(testExamples,pickleDataFile) 438 pickleDataFile.close() 439 440 if details.bayesModel: 441 composite = BayesComposite.BayesComposite() 442 else: 443 composite = Composite.Composite() 444 445 composite._randomSeed = seed 446 composite._splitFrac = details.splitFrac 447 composite._shuffleActivities = details.shuffleActivities 448 composite._randomizeActivities = details.randomActivities 449 450 if hasattr(details,'filterFrac'): 451 composite._filterFrac = details.filterFrac 452 if hasattr(details,'filterVal'): 453 composite._filterVal = details.filterVal 454 455 composite.SetModelFilterData(details.modelFilterFrac, details.modelFilterVal) 456 457 composite.SetActivityQuantBounds(details.activityBounds) 458 nPossibleVals = data.GetNPossibleVals() 459 if details.activityBounds: 460 nPossibleVals[-1] = len(details.activityBounds)+1 461 462 463 if setDescNames: 464 composite.SetInputOrder(data.GetVarNames()) 465 composite.SetDescriptorNames(details._descNames) 466 else: 467 composite.SetDescriptorNames(data.GetVarNames()) 468 composite.SetActivityQuantBounds(details.activityBounds) 469 if details.nModels==1: 470 details.internalHoldoutFrac=0.0 471 if details.useTrees: 472 from ML.DecTree import CrossValidate,PruneTree 473 if details.qBounds != []: 474 from ML.DecTree import BuildQuantTree 475 builder = BuildQuantTree.QuantTreeBoot 476 else: 477 from ML.DecTree import ID3 478 builder = ID3.ID3Boot 479 driver = CrossValidate.CrossValidationDriver 480 pruner = PruneTree.PruneTree 481 482 composite.SetQuantBounds(details.qBounds) 483 nPossibleVals = data.GetNPossibleVals() 484 if details.activityBounds: 485 nPossibleVals[-1] = len(details.activityBounds)+1 486 composite.Grow(trainExamples,attrs,nPossibleVals=[0]+nPossibleVals, 487 buildDriver=driver, 488 pruner=pruner, 489 nTries=details.nModels,pruneIt=details.pruneIt, 490 lessGreedy=details.lessGreedy,needsQuantization=0, 491 treeBuilder=builder,nQuantBounds=details.qBounds, 492 startAt=details.startAt, 493 maxDepth=details.limitDepth, 494 progressCallback=progressCallback, 495 holdOutFrac=details.internalHoldoutFrac, 496 replacementSelection=details.replacementSelection, 497 recycleVars=details.recycleVars, 498 randomDescriptors=details.randomDescriptors, 499 silent=not _verbose) 500 501 elif details.useSigTrees: 502 from ML.DecTree import CrossValidate 503 from ML.DecTree import BuildSigTree 504 builder = BuildSigTree.SigTreeBuilder 505 driver = CrossValidate.CrossValidationDriver 506 nPossibleVals = data.GetNPossibleVals() 507 if details.activityBounds: 508 nPossibleVals[-1] = len(details.activityBounds)+1 509 if hasattr(details,'sigTreeBiasList'): 510 biasList = details.sigTreeBiasList 511 else: 512 biasList=None 513 if hasattr(details,'useCMIM'): 514 useCMIM=details.useCMIM 515 else: 516 useCMIM=0 517 if hasattr(details,'allowCollections'): 518 allowCollections = details.allowCollections 519 else: 520 allowCollections=False 521 composite.Grow(trainExamples,attrs,nPossibleVals=[0]+nPossibleVals, 522 buildDriver=driver, 523 nTries=details.nModels, 524 needsQuantization=0, 525 treeBuilder=builder, 526 maxDepth=details.limitDepth, 527 progressCallback=progressCallback, 528 holdOutFrac=details.internalHoldoutFrac, 529 replacementSelection=details.replacementSelection, 530 recycleVars=details.recycleVars, 531 randomDescriptors=details.randomDescriptors, 532 biasList=biasList, 533 useCMIM=useCMIM, 534 allowCollection=allowCollections, 535 silent=not _verbose) 536 537 elif details.useKNN: 538 from ML.KNN import CrossValidate 539 from ML.KNN import DistFunctions 540 541 driver = CrossValidate.CrossValidationDriver 542 dfunc = '' 543 if (details.knnDistFunc == "Euclidean") : 544 dfunc = DistFunctions.EuclideanDist 545 elif (details.knnDistFunc == "Tanimoto"): 546 dfunc = DistFunctions.TanimotoDist 547 else: 548 assert 0,"Bad KNN distance metric value" 549 550 551 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals, 552 buildDriver=driver, nTries=details.nModels, 553 needsQuantization=0, 554 numNeigh=details.knnNeighs, 555 holdOutFrac=details.internalHoldoutFrac, 556 distFunc=dfunc) 557 558 elif details.useNaiveBayes or details.useSigBayes: 559 from ML.NaiveBayes import CrossValidate 560 driver = CrossValidate.CrossValidationDriver 561 if not (hasattr(details,'useSigBayes') and details.useSigBayes): 562 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals, 563 buildDriver=driver, nTries=details.nModels, 564 needsQuantization=0, nQuantBounds=details.qBounds, 565 holdOutFrac=details.internalHoldoutFrac, 566 replacementSelection=details.replacementSelection, 567 mEstimateVal=details.mEstimateVal, 568 silent=not _verbose) 569 else: 570 if hasattr(details,'useCMIM'): 571 useCMIM=details.useCMIM 572 else: 573 useCMIM=0 574 575 composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals, 576 buildDriver=driver, nTries=details.nModels, 577 needsQuantization=0, nQuantBounds=details.qBounds, 578 mEstimateVal=details.mEstimateVal, 579 useSigs=True,useCMIM=useCMIM, 580 holdOutFrac=details.internalHoldoutFrac, 581 replacementSelection=details.replacementSelection, 582 silent=not _verbose) 583 584 585 586 ## elif details.useSVM: 587 ## from ML.SVM import CrossValidate 588 ## driver = CrossValidate.CrossValidationDriver 589 ## composite.Grow(trainExamples, attrs, nPossibleVals=[0]+nPossibleVals, 590 ## buildDriver=driver, nTries=details.nModels, 591 ## needsQuantization=0, 592 ## cost=details.svmCost,gamma=details.svmGamma, 593 ## weights=details.svmWeights,degree=details.svmDegree, 594 ## type=details.svmType,kernelType=details.svmKernel, 595 ## coef0=details.svmCoeff,eps=details.svmEps,nu=details.svmNu, 596 ## cache_size=details.svmCache,shrinking=details.svmShrink, 597 ## dataType=details.svmDataType, 598 ## holdOutFrac=details.internalHoldoutFrac, 599 ## replacementSelection=details.replacementSelection, 600 ## silent=not _verbose) 601 602 else: 603 from ML.Neural import CrossValidate 604 driver = CrossValidate.CrossValidationDriver 605 composite.Grow(trainExamples,attrs,[0]+nPossibleVals,nTries=details.nModels, 606 buildDriver=driver,needsQuantization=0) 607 608 composite.AverageErrors() 609 composite.SortModels() 610 modelList,counts,avgErrs = composite.GetAllData() 611 counts = numpy.array(counts) 612 avgErrs = numpy.array(avgErrs) 613 composite._varNames = data.GetVarNames() 614 615 for i in xrange(len(modelList)): 616 modelList[i].NameModel(composite._varNames) 617 618 # do final statistics 619 weightedErrs = counts*avgErrs 620 averageErr = sum(weightedErrs)/sum(counts) 621 devs = (avgErrs - averageErr) 622 devs = devs * counts 623 devs = numpy.sqrt(devs*devs) 624 avgDev = sum(devs)/sum(counts) 625 message('# Overall Average Error: %%% 5.2f, Average Deviation: %%% 6.2f'%(100.*averageErr,100.*avgDev)) 626 627 if details.bayesModel: 628 composite.Train(trainExamples,verbose=0) 629 630 # blow out the saved examples and then save the composite: 631 composite.ClearModelExamples() 632 if saveIt: 633 composite.Pickle(details.outName) 634 details.model = DbModule.binaryHolder(cPickle.dumps(composite)) 635 636 badExamples = [] 637 if not details.detailedRes and (not hasattr(details,'noScreen') or not details.noScreen): 638 if details.splitRun: 639 message('Testing all hold-out examples') 640 wrong = testall(composite,testExamples,badExamples) 641 message('%d examples (%% %5.2f) were misclassified'%(len(wrong), 642 100.*float(len(wrong))/float(len(testExamples)))) 643 _runDetails.holdout_error = float(len(wrong))/len(testExamples) 644 else: 645 message('Testing all examples') 646 wrong = testall(composite,namedExamples,badExamples) 647 message('%d examples (%% %5.2f) were misclassified'%(len(wrong), 648 100.*float(len(wrong))/float(len(namedExamples)))) 649 _runDetails.overall_error = float(len(wrong))/len(namedExamples) 650 651 if details.detailedRes: 652 message('\nEntire data set:') 653 resTup = ScreenComposite.ShowVoteResults(range(data.GetNPts()),data,composite, 654 nPossibleVals[-1],details.threshold) 655 nGood,nBad,nSkip,avgGood,avgBad,avgSkip,voteTab = resTup 656 nPts = len(namedExamples) 657 nClass = nGood+nBad 658 _runDetails.overall_error = float(nBad) / nClass 659 _runDetails.overall_correct_conf = avgGood 660 _runDetails.overall_incorrect_conf = avgBad 661 _runDetails.overall_result_matrix = repr(voteTab) 662 nRej = nClass-nPts 663 if nRej > 0: 664 _runDetails.overall_fraction_dropped = float(nRej)/nPts 665 666 if details.splitRun: 667 message('\nHold-out data:') 668 resTup = ScreenComposite.ShowVoteResults(range(len(testExamples)),testExamples, 669 composite, 670 nPossibleVals[-1],details.threshold) 671 nGood,nBad,nSkip,avgGood,avgBad,avgSkip,voteTab = resTup 672 nPts = len(testExamples) 673 nClass = nGood+nBad 674 _runDetails.holdout_error = float(nBad) / nClass 675 _runDetails.holdout_correct_conf = avgGood 676 _runDetails.holdout_incorrect_conf = avgBad 677 _runDetails.holdout_result_matrix = repr(voteTab) 678 nRej = nClass-nPts 679 if nRej > 0: 680 _runDetails.holdout_fraction_dropped = float(nRej)/nPts 681 682 683 if details.persistTblName and details.dbName: 684 message('Updating results table %s:%s'%(details.dbName,details.persistTblName)) 685 details.Store(db=details.dbName,table=details.persistTblName) 686 687 if details.badName != '': 688 badFile = open(details.badName,'w+') 689 for i in xrange(len(badExamples)): 690 ex = badExamples[i] 691 vote = wrong[i] 692 outStr = '%s\t%s\n'%(ex,vote) 693 badFile.write(outStr) 694 badFile.close() 695 696 composite.ClearModelExamples() 697 return composite
698
699 -def RunIt(details,progressCallback=None,saveIt=1,setDescNames=0):
700 """ does the actual work of building a composite model 701 702 **Arguments** 703 704 - details: a _CompositeRun.CompositeRun_ object containing details 705 (options, parameters, etc.) about the run 706 707 - progressCallback: (optional) a function which is called with a single 708 argument (the number of models built so far) after each model is built. 709 710 - saveIt: (optional) if this is nonzero, the resulting model will be pickled 711 and dumped to the filename specified in _details.outName_ 712 713 - setDescNames: (optional) if nonzero, the composite's _SetInputOrder()_ method 714 will be called using the results of the data set's _GetVarNames()_ method; 715 it is assumed that the details object has a _descNames attribute which 716 is passed to the composites _SetDescriptorNames()_ method. Otherwise 717 (the default), _SetDescriptorNames()_ gets the results of _GetVarNames()_. 718 719 **Returns** 720 721 the composite model constructed 722 723 724 """ 725 details.rundate = time.asctime() 726 727 fName = details.tableName.strip() 728 if details.outName == '': 729 details.outName = fName + '.pkl' 730 if not details.dbName: 731 if details.qBounds != []: 732 data = DataUtils.TextFileToData(fName) 733 else: 734 data = DataUtils.BuildQuantDataSet(fName) 735 elif details.useSigTrees or details.useSigBayes: 736 details.tableName = fName 737 data = details.GetDataSet(pickleCol=0,pickleClass=DataStructs.ExplicitBitVect) 738 elif details.qBounds != [] or not details.useTrees: 739 details.tableName = fName 740 data = details.GetDataSet() 741 else: 742 data = DataUtils.DBToQuantData(details.dbName,fName,quantName=details.qTableName, 743 user=details.dbUser,password=details.dbPassword) 744 745 composite = RunOnData(details,data,progressCallback=progressCallback, 746 saveIt=saveIt,setDescNames=setDescNames) 747 return composite
748 749
750 -def ShowVersion(includeArgs=0):
751 """ prints the version number 752 753 """ 754 print 'This is BuildComposite.py version %s'%(__VERSION_STRING) 755 if includeArgs: 756 import sys 757 print 'command line was:' 758 print ' '.join(sys.argv)
759
760 -def Usage():
761 """ provides a list of arguments for when this is used from the command line 762 763 """ 764 import sys 765 print __doc__ 766 sys.exit(-1)
767
768 -def SetDefaults(runDetails=None):
769 """ initializes a details object with default values 770 771 **Arguments** 772 773 - details: (optional) a _CompositeRun.CompositeRun_ object. 774 If this is not provided, the global _runDetails will be used. 775 776 **Returns** 777 778 the initialized _CompositeRun_ object. 779 780 781 """ 782 if runDetails is None: runDetails = _runDetails 783 return CompositeRun.SetDefaults(runDetails)
784
785 -def ParseArgs(runDetails):
786 """ parses command line arguments and updates _runDetails_ 787 788 **Arguments** 789 790 - runDetails: a _CompositeRun.CompositeRun_ object. 791 792 """ 793 import getopt 794 args,extra = getopt.getopt(sys.argv[1:],'P:o:n:p:b:sf:F:v:hlgd:rSTt:BQ:q:DVG:N:L:', 795 ['nRuns=','prune','profile', 796 'seed=','noScreen', 797 798 'modelFiltFrac=', 'modelFiltVal=', 799 800 'recycle','randomDescriptors=', 801 802 'doKnn','knnK=','knnTanimoto','knnEuclid', 803 804 'doSigTree','doCMIM=','allowCollections', 805 806 'doNaiveBayes', 'mEstimateVal=', 807 'doSigBayes', 808 809 ## 'doSVM','svmKernel=','svmType=','svmGamma=', 810 ## 'svmCost=','svmWeights=','svmDegree=', 811 ## 'svmCoeff=','svmEps=','svmNu=','svmCache=', 812 ## 'svmShrink','svmDataType=', 813 814 'replacementSelection', 815 816 ]) 817 runDetails.profileIt=0 818 for arg,val in args: 819 if arg == '-n': 820 runDetails.nModels = int(val) 821 elif arg == '-N': 822 runDetails.note=val 823 elif arg == '-o': 824 runDetails.outName = val 825 elif arg == '-Q': 826 qBounds = eval(val) 827 assert type(qBounds) in [type([]),type(())],'bad argument type for -Q, specify a list as a string' 828 runDetails.activityBounds=qBounds 829 runDetails.activityBoundsVals=val 830 elif arg == '-p': 831 runDetails.persistTblName=val 832 elif arg == '-P': 833 runDetails.pickleDataFileName= val 834 elif arg == '-r': 835 runDetails.randomActivities = 1 836 elif arg == '-S': 837 runDetails.shuffleActivities = 1 838 elif arg == '-b': 839 runDetails.badName = val 840 elif arg == '-B': 841 runDetails.bayesModels=1 842 elif arg == '-s': 843 runDetails.splitRun = 1 844 elif arg == '-f': 845 runDetails.splitFrac=float(val) 846 elif arg == '-F': 847 runDetails.filterFrac=float(val) 848 elif arg == '-v': 849 runDetails.filterVal=float(val) 850 elif arg == '-l': 851 runDetails.lockRandom = 1 852 elif arg == '-g': 853 runDetails.lessGreedy=1 854 elif arg == '-G': 855 runDetails.startAt = int(val) 856 elif arg == '-d': 857 runDetails.dbName=val 858 elif arg == '-T': 859 runDetails.useTrees = 0 860 elif arg == '-t': 861 runDetails.threshold=float(val) 862 elif arg == '-D': 863 runDetails.detailedRes = 1 864 elif arg == '-L': 865 runDetails.limitDepth = int(val) 866 elif arg == '-q': 867 qBounds = eval(val) 868 assert type(qBounds) in [type([]),type(())],'bad argument type for -q, specify a list as a string' 869 runDetails.qBoundCount=val 870 runDetails.qBounds = qBounds 871 elif arg == '-V': 872 ShowVersion() 873 sys.exit(0) 874 elif arg == '--nRuns': 875 runDetails.nRuns = int(val) 876 elif arg == '--modelFiltFrac': 877 runDetails.modelFilterFrac=float(val) 878 elif arg == '--modelFiltVal': 879 runDetails.modelFilterVal=float(val) 880 elif arg == '--prune': 881 runDetails.pruneIt=1 882 elif arg == '--profile': 883 runDetails.profileIt=1 884 885 elif arg == '--recycle': 886 runDetails.recycleVars=1 887 elif arg == '--randomDescriptors': 888 runDetails.randomDescriptors=int(val) 889 890 elif arg == '--doKnn': 891 runDetails.useKNN=1 892 runDetails.useTrees=0 893 ## runDetails.useSVM=0 894 runDetails.useNaiveBayes=0 895 elif arg == '--knnK': 896 runDetails.knnNeighs = int(val) 897 elif arg == '--knnTanimoto': 898 runDetails.knnDistFunc="Tanimoto" 899 elif arg == '--knnEuclid': 900 runDetails.knnDistFunc="Euclidean" 901 902 elif arg == '--doSigTree': 903 ## runDetails.useSVM=0 904 runDetails.useKNN=0 905 runDetails.useTrees=0 906 runDetails.useNaiveBayes=0 907 runDetails.useSigTrees=1 908 elif arg == '--doCMIM': 909 runDetails.useCMIM=int(val) 910 elif arg == '--allowCollections': 911 runDetails.allowCollections=True 912 913 elif arg == '--doNaiveBayes': 914 runDetails.useNaiveBayes=1 915 ## runDetails.useSVM=0 916 runDetails.useKNN=0 917 runDetails.useTrees=0 918 runDetails.useSigBayes=0 919 elif arg == '--doSigBayes': 920 runDetails.useSigBayes=1 921 runDetails.useNaiveBayes=0 922 ## runDetails.useSVM=0 923 runDetails.useKNN=0 924 runDetails.useTrees=0 925 elif arg == '--mEstimateVal': 926 runDetails.mEstimateVal=float(val) 927 928 ## elif arg == '--doSVM': 929 ## runDetails.useSVM=1 930 ## runDetails.useKNN=0 931 ## runDetails.useTrees=0 932 ## runDetails.useNaiveBayes=0 933 ## elif arg == '--svmKernel': 934 ## if val not in SVM.kernels.keys(): 935 ## message('kernel %s not in list of available kernels:\n%s\n'%(val,SVM.kernels.keys())) 936 ## sys.exit(-1) 937 ## else: 938 ## runDetails.svmKernel=SVM.kernels[val] 939 ## elif arg == '--svmType': 940 ## if val not in SVM.machineTypes.keys(): 941 ## message('type %s not in list of available machines:\n%s\n'%(val,SVM.machineTypes.keys())) 942 ## sys.exit(-1) 943 ## else: 944 ## runDetails.svmType=SVM.machineTypes[val] 945 ## elif arg == '--svmGamma': 946 ## runDetails.svmGamma = float(val) 947 ## elif arg == '--svmCost': 948 ## runDetails.svmCost = float(val) 949 ## elif arg == '--svmWeights': 950 ## # FIX: this is dangerous 951 ## runDetails.svmWeights = eval(val) 952 ## elif arg == '--svmDegree': 953 ## runDetails.svmDegree = int(val) 954 ## elif arg == '--svmCoeff': 955 ## runDetails.svmCoeff = float(val) 956 ## elif arg == '--svmEps': 957 ## runDetails.svmEps = float(val) 958 ## elif arg == '--svmNu': 959 ## runDetails.svmNu = float(val) 960 ## elif arg == '--svmCache': 961 ## runDetails.svmCache = int(val) 962 ## elif arg == '--svmShrink': 963 ## runDetails.svmShrink = 0 964 ## elif arg == '--svmDataType': 965 ## runDetails.svmDataType=val 966 967 elif arg== '--seed': 968 # FIX: dangerous 969 runDetails.randomSeed = eval(val) 970 971 elif arg== '--noScreen': 972 runDetails.noScreen=1 973 974 elif arg== '--replacementSelection': 975 runDetails.replacementSelection = 1 976 977 elif arg == '-h': 978 Usage() 979 980 else: 981 Usage() 982 runDetails.tableName=extra[0]
983 984 if __name__ == '__main__': 985 if len(sys.argv) < 2: 986 Usage() 987 988 _runDetails.cmd = ' '.join(sys.argv) 989 SetDefaults(_runDetails) 990 ParseArgs(_runDetails) 991 992 993 ShowVersion(includeArgs=1) 994 995 if _runDetails.nRuns > 1: 996 for i in range(_runDetails.nRuns): 997 sys.stderr.write('---------------------------------\n\tDoing %d of %d\n---------------------------------\n'%(i+1,_runDetails.nRuns)) 998 RunIt(_runDetails) 999 else: 1000 if _runDetails.profileIt: 1001 import hotshot,hotshot.stats 1002 prof=hotshot.Profile('prof.dat') 1003 prof.runcall(RunIt,_runDetails) 1004 stats = hotshot.stats.load('prof.dat') 1005 stats.strip_dirs() 1006 stats.sort_stats('time','calls') 1007 stats.print_stats(30) 1008 else: 1009 RunIt(_runDetails) 1010