JIST2015 Data Challenge



  • Apply Filters for Attributes
    • Change all properties to nominal using regex pattern
      • String regex = "^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
      • Replace all {0} or {1} to {0,1} in train.arff and test.arff files to let them be compatible.
    • Normalize all numeric attributes into [0,1] interval


  • Save Predictions to a File
    • [Visualize classifier errors] - > save as a .arff file
    • remove all headers and import it into a .csv file
    • final two columns show the "predicted classes" and "actual classes"





  • Parameter Selection
    • Use CVParameterSelection as the classifier
    • Choose a classifier to optimize the parameters
    • Add a parameter need to be tuned in CVParameters
      • e.g., I 10 250 25 for Random Forest will test 10, 20 ... 250 (25 steps) for the [number of trees] parameter
      • It will show the best option as output: "Classifier Options: -I 90 -K 0 -S 1"

  • Feature Selection
    • Fast feature (attribute) selection using ranking (using searching is not feasible due to the number of attributes is large)
      • GainRatioAttributeEval with Rank 
      • Set threshold < 0.1, < 0.2, < 0.3, < 0.4 to run experiments to see which threshold is working best -> 0.2

  • Consider Index (Can it improve the performance?)
    • Used formula 1-index/size
    • Answer: