"%%%%%%%%%%%%%%%%%% Data-Description % %%%%%%%%%%%%%%%%%% COIL 1999 Competition Data Data Type multivariate Abstract This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities. Sources Original Owner [1]ERUDIT European Network for Fuzzy Logic and Uncertainty Modelling in Information Technology Donor Jens Strackeljan Technical University Clausthal Institute of Applied Mechanics Graupenstr. 3, 38678 Clausthal-Zellerfeld, Germany [2]tmjs@itm.tu-clausthal.de Date Donated: September 9, 1999 Data Characteristics This data comes from a water quality study where samples were taken from sites on different European rivers of a period of approximately one year. These samples were analyzed for various chemical substances including: nitrogen in the form of nitrates, nitrites and ammonia, phosphate, pH, oxygen, chloride. In parallel, algae samples were collected to determine the algae population distributions. Other Relevant Information The competition involved the prediction of algal frequency distributions on the basis of the measured concentrations of the chemical substances and the global information concerning the season when the sample was taken, the river size and its flow velocity. The competition [3]instructions contain additional information on the prediction task. Data Format There are a total of 340 examples each containing 17 values. The first 11 values of each data set are the season, the river size, the fluid velocity and 8 chemical concentrations which should be relevant for the algae population distribution. The last 8 values of each example are the distribution of different kinds of algae. These 8 kinds are only a very small part of the whole community, but for the competition we limited the number to 7. The value 0.0 means that the frequency is very low. The data set also contains some empty fields which are labeled with the string XXXXX. The training data are saved in the file: analysis.data (ASCII format). Table 1: Structure of the file analysis.data A K a g CC[1,1] CC[1,11] AG[1,1] AG[1,7] CC[200,1] CC[200,11] AG[200,1] AG[200,7] Explanation: CC[i,j]: Chemical concentration or river characteristic AG[i,j]: Algal frequency The chemical parameters are labeled as A, ..., K. The columns of the algaes are labeled as a, ..,g. Past Usage [4]The Third (1999) International COIL Competition Home Page _________________________________________________________________ [5]The UCI KDD Archive [6]Information and Computer Science [7]University of California, Irvine Irvine, CA 92697-3425 Last modified: October 13, 1999 References 1. http://www.erudit.de/ 2. mailto:tmjs@itm.tu-clausthal.de 3. file://localhost/research/ml/datasets/uci/raw/data/ucikdd/coil/instructions.txt 4. http://www.erudit.de/erudit/activities/ic-99/index.htm 5. http://kdd.ics.uci.edu/ 6. http://www.ics.uci.edu/ 7. http://www.uci.edu/ %%%%%%%%%%%%%%%%%% Task-Description % %%%%%%%%%%%%%%%%%% Third International Competition Protecting rivers and streams by monitoring chemical concentrations and algae communities. Intelligent Techniques for Monitoring Water Quality using chemical indicators and algae population Recent years have been characterised by increasing concern at the impact man is having on the environment. The impact on the environment of toxic waste, from a wide variety of manufacturing processes, is well known. More recently, however, it has become clear that the more subtle effects of nutrient level and chemical balance changes arising from farming land run-off and sewage water treatment also have a serious, but indirect, effect on the states of rivers, lakes and even the sea. In temperate climates across the world summers are characterized by numerous reports excessive summer algae growth resulting in poor water clarity, mass deaths of river fish from reduced oxygen levels and the closure of recreational water facilities on account of the toxic effects of this annual algal bloom. Reducing the impact of these man-made changes in river nutrient levels has stimulated much biological research with the aim of identifying the crucial chemical control variables for the biological processes. The data used in this problem comes from one such study. During the research study water quality samples were taken from sites on different European rivers of a period of approximately one year. These samples were analyzed for various chemical substances including: nitrogen in the form of nitrates, nitrites and ammonia, phosphate, pH, oxygen, chloride. In parallel, algae samples were collected to determine the algae population distributions. It is well known that the dynamics of the algae community is determined by external chemical environment with one or more factors being predominant. While the chemical analysis is cheap and easily automated, the biological part involves microscopic examination, requires trained manpower and is therefore both expensive and slow. Diatoms like Cymbella are major contributors to primary production throughout the world. The diatom reacts with large sensitivity to even small changes in acidity . Over a three and half billion year history algae have evolved and adapted as primary plant colonizers of almost every known habitant in terrestrial and aquatic environments. They respond very rapidly to man-made environment changes. The relationship between the chemical and biological features is complex and can be expected to need the application of advanced techniques. Typical of such real-life problems, the particular data set for the problem contains a mixture of (fuzzy) qualiative variables and numerical measurement values, with much of the data being incomplete. The competition task is the prediction of algal frequency distributions on the basis of the measured concentrations of the chemical substances and the global information concerning the season when the sample was taken, the river size and its flow velocity. The two last variables are given as linguistic variables. 340 data sets were taken and each contain 17 values. The first 11 values of each data set are the season, the river size, the fluid velocity and 8 chemical concentrations which should be relevant for the algae population distribution. The last 8 values of each data set are the distribution of different kinds of algae. These 8 kinds are only a very small part of the whole community, but for the competition we limited the number to 7. The value 0.0 means that the frequency is very low. The data set also contains some empty fields which are labeled with the string XXXXX. Each participant in the competition receives 200 complete data sets (training data) and 140 data sets (evaluation data) containing only the 11 values of the river descriptions and the chemical concentrations. This training data is to be used in obtainin a 'model' providing a prediction of the algal distributions associated with the evaluation data. The training data are saved in the file: analysis.txt (ASCII format). Structure of the file analysis.txt A K a g CC1,1 ... CC1,11 AG1,1 ... AG1,7 .... ... ... ... CC200,1 ... CC200,11 AG240,1 ... AG240,7 Explanation: CCi,j: Chemical concentration j=1,..11 AGi,k: Algal frequency k=1...7 The chemical parameters are labeled as A, ..., K. The columns of the algaes are labeled as a, ..,g. Evaluation data are saved in file eval.txt (ASCII format). Table 2: Structure of the file eval.* A K CC1,1 ... CC1,11 ..... ... CC140,1 ... CC140,11 _____________________________________________________________ Objective The objective of the competition is to provide a prediction model on basis of the training data. Having obtained this prediction model, each participant must provide the solution in the form of the results of applying this model to the evaluation data. The results obtained in this way should correspond to the results of the evaluation data (which are known to the organizer). The criteria used to evaluate the results is given below. All 7 Algae frequency distributions must be determined. For this purpose any number of partial models may be developed. _____________________________________________________________ Judgment of the results To judge the results, the sum of squared errors will be calculated. The following Table describes the results of a particular participant. Matrix of results a g Res1,1 ... Res1,7 .... ... Res140,1 Res140,7 All solutions that lead to a smallest total error will be regarded as winner of the contest. Information about the dataset CLASSTYPE: numeric CLASSINDEX: last ALGAE #: 6/7" "0" "coil-test-6" 7.95 7.98 8 8.35 8.1 8.37 7.31 7.91 7.99 7.82 6.6 6.79 6.78 7.8 8.3 8.2 8.2 8.1 8.54 8.5 7.7 8.4 8.5 8.1 6.13 nan 7.2 7.9 7.8 7.8 7.8 5.9 6.8 6.6 6.6 6.9 7.66 7.6 8 8.2 8.2 8 8.9 8.2 7.8 7.5 7.4 7.5 9.1 8.9 8.5 8.3 7.8 7.6 7.5 7.5 7.7 7.9 8.06 8.5 8.2 8 8.7 8.4 8.5 8.3 8.4 7.9 9.13 7.4 8.3 8.2 8.2 8.17 8.33 8.5 8.1 8.2 8.1 7.8 8.26 8.11 7.87 7.2 7.8 7.9 8 7.7 8.7 7.9 8.7 8.6 7.4 7.8 8.5 7.98 7.95 7.96 8.35 8.15 8.5 8.5 8.5 8.4 8.2 8 8.3 8.7 8.6 8 8.2 8.5 8.04 7.95 7.25 7.64 7.92 7.62 7.75 7.08 6.92 8.1 7.2 8.61 8.22 8.53 8.4 8.1 8.12 8.43 8.7 8.1 8.35 5.7 8.8 7.2 8.4 13.2 12.1 9.9 11.2 10.7 11.5 10.8 9.4 10.2 10.8 12.7 11.3 10.4 6.4 12.83 7.8 6.8 10.5 11.5 12.2 11.23 12.1 10.4 8.6 9.1 9.4 11.35 11.9 9.1 8.8 11.8 9.2 10.8 10.5 7 7.8 10.7 8.5 10.5 9.2 8.8 10.8 9 8.9 8 8 10.74 8.6 6.3 9.2 9.2 8.6 4.8 7.2 2.2 7.5 10.4 4.8 10.8 11.2 8.3 8.8 10.8 11.9 12 11.4 8.9 10.4 11.2 6.3 10.6 6.7 9.1 11.9 9.4 7.9 5 6.6 1.8 10.1 8.3 11.3 8.8 9.3 5.4 5.3 12.2 6.5 7.3 10.4 9.8 5.6 7.2 5.5 2.75 10.4 10.1 11.4 8.5 11.43 9.9 10.98 8.9 6.8 10.4 9.1 9.5 9.6 9.3 9.1 9.54 10.3 8.5 9.4 10.7 8.4 11.1 9.8 11.3 10.1 9.5 10.5 10 10.9 10.2 10.8 11.7 8.2 11.1 57.333 59.333 80 68 19 12.85 6 5 4 8.18 4 11.42 10.704 14.568 27 6 3.577 21.2 22.545 71 65 50.6 57.292 66 8.87 18 18 27.65 36.124 5.714 5.343 nan nan nan nan nan 4 3.05 37.091 37.625 134.667 131.469 34.8 30.037 29.078 10.357 13.75 55.8 101.2 60.2 56.292 75 136.667 64.778 61.557 57.5 88.909 55.25 39 9.3 63.3 58.767 1.118 0.5 36.583 64.768 47.304 11.862 30.496 12.031 271.5 41 36 37.3 36.156 45.609 47.267 12.25 11 87 44.818 49.857 49.25 49.5 51.5 82.5 176.25 66 48 48 32.23 43 19 22.5 70.25 47.06 57.286 131.364 97.733 189.567 3 3 4.025 4.966 6.4 9.7 42.058 16.889 15.182 15.375 17.875 16.545 130.263 76.886 nan 34.235 10.867 11.055 15.5 9.45 9.1 14.34 8.97 3.518 2.3 3 3.51 9.056 7.613 35.642 21.4656 26.54 22.56 2.46 7.392 1.957 3.026 0 0.84 1.395 1.383 1.368 1.488 1.18 1.966 1.46 1.228 4.04 1.56 0.788 3.222 4 11.02 1.833 10.494 10.526 4.08 0.62 3.14 2.42 2.063 5.974 0.807 1.363 1.88 0.78 0.95 2.21 2.21 0.997 1.002 2.237 1.453 4.504 3.454 6 5.184 2.823 3.35 5.268 4.408 4.306 4.033 0.694 5.18 3.734 6.164 7.035 7.368 1.714 2.235 2.085 1.557 0.389 0.308 0.534 0.32 5.632 6.272 7.773 2.209 4.971 1.621 6.315 5.16 4.4 0.527 1.137 4.411 9.367 2.348 2.251 12.13 0.526 0.993 0.611 3.955 2.098 6.283 0.618 3.56 1.139 0.513 1.887 0.668 4.39 4.72 1.644 3.088 3.746 3.313 3.681 5.011 0.851 0.774 0.825 0.969 0.553 0.874 5.922 2.139 2.502 2.118 2.363 3.849 3.776 3.461 0.642 2.942 1.715 1.51 3.976 1.572 0.63 0.73 0.23 0.663 0.672 0.758 0.866 0.825 0.699 6.225 3.765 2.805 3.14 273.333 286.667 174.286 458 130 15 58.75 6 117 39 80 42 46 61.25 10 10 10.583 44 170.5 500 782.5 334 312.6 10 36 10 80 62.5 169 22.143 19.75 5 10 20 10 10 15 13.333 146.364 105.714 617.778 792 122.556 174.8 263.556 127.667 58.75 389 273.75 306.471 264.8 560 154.444 720 558.333 577 669.091 89.375 773.125 260 217.143 93.75 26.364 10 440.833 357.167 258.909 128.636 99.6 176.8 375 410 32.5 82 119.444 160 169.091 121.875 48.75 652.5 97.273 194.28 357.125 55 30.2 300 440 310 144.286 138.333 233.5 95 120 178.75 285 357 425.714 810.9 137.444 162.944 37.778 10.909 23.636 24.111 21.429 67.7 116.727 30 140.909 43.75 63.75 103.273 131.008 93.827 85 41.43 199.54 13.56 57.64 26.54 21 22.5 134.5 12.22 9.87 10.35 29.65 41 33.56 134 91.45 42.75 76.2 295.667 33.333 47.857 45.2 6 5 6 24.333 17.25 16 2 3 3 34.5 363 2 1.667 54.8 68 121 77.25 209.1 261.4 26 3 21 11 7.75 13.091 6 5.818 1 1 1 1 2 1.5 1.667 84.091 66.714 49.444 63.1 41.111 86.6 27 22 56.25 127.4 152.875 136 43.4 30.5 35.556 21.778 24.5 67.3 38.182 17.5 90.75 9.6 24.333 33.375 14.818 21.6 149 219 145.091 48.091 64.6 36.3 169 38 108 62 92.889 88.364 75 14 17.375 93.25 105.455 77 128.25 18 24.6 12.333 16.25 37 36.714 61.333 17.5 10.5 74.857 116.5 68.714 311.4 291.143 311.455 91 135.778 10.778 3.727 5.583 6 12 26.6 150.583 37.111 31.909 48.875 44 34.273 97.5 68.333 14.6 17 3.222 4 10.5 4 5 23 13 3.222 4 4.1 5.8 20 28.034 103.5 38 48.5 41 380 138 113.714 111.8 40 10.507 16 30 44.75 139.5 59 15 13.714 62 482 5 2.088 155 116.069 nan 340 276.667 299.4 70 14.741 41 44 30 71.057 18.714 8.846 2 14 7 4 13 7.333 10.833 172.778 143.4 164.778 286.6 144.111 130.8 95.12 34.321 64 206.2 290.313 242.941 124.942 170 175.333 242.5 257.333 254.444 205.182 141.5 163.25 18.1 114 110.875 20.9 27.6 266.364 302.5 223.044 69.079 146.265 58.599 313.5 61 155.5 133.1 112.855 180.364 127.778 27.5 66.875 209 181.636 197.571 185.125 138 184.4 53.333 79.25 nan 66.833 89.167 66.167 74.667 166.286 201 132 342.3 330 349.818 155.556 219.278 23.889 8.091 31.091 18.167 76.286 51.034 220.723 85.444 77.7 86.5 77 63.4 152.966 146.049 19.45 41.567 27.2 12.65 43.169 13.6 nan 45.5 19 7 6.123 nan 15 58 49.658 nan 83 88.125 98.665 nan 7.1 4.5 3.2 2 13.8 0.8 32 0.8 0.4 0.6 0.6 0.7 1.1 6 nan 0.8 61.52 41.6 7.1 9 20.72 23.5 1.8 2.1 4.8 2.5 nan 3.3 1.5 1.9 nan nan nan nan nan 1 nan 2.3 2.6 19.2 8.2 27.03 3.45 11.5 1.2 2.5 5 10.7 18.4 30.48 16.7 2.7 54.2 19.5 22 2.8 17 26 3.9 2.7 2.7 1.4 0.6 19.827 8.267 13.36 2.755 54.13 36.1 2.8 6 3 1.4 10.5 32.833 3.667 4.6 2.5 6 20.6 13 4.5 49 31.3 13.7 3.5 17.35 22.017 4 39.333 63.5 5.3 2.7 16.028 18.53 4.714 20.47 2.744 2.859 0.5 3.6 2.4 2.133 1.3 2.2 6.7 23.033 15.318 8.125 8.463 14.682 6.15 3.95 0.46 7.43 1.9 1.456 3.12 0.675 2.46 0.85 nan 1.3 0.8 4 2.86 nan 2.2 45.375 17 13.98 17.456 0 46.4 0 1 0 0 0 3.7 0 1.7 0 0 0 0 22.8 0 0 0 2.4 8.3 1 45.7 42.3 10.3 0 14.5 3.7 0 3 0 0 0 0 0 0 3.4 0 0 0 0 10.9 29.3 17.8 2.2 1 0 27.6 50.8 61.1 54.3 1.4 70.7 3.3 5.1 9.5 22.4 0 0 0 8.4 24.6 0 0 0 5 2.1 39.8 19.8 20.9 3.1 3.8 25.7 11.6 0 0 0 37.1 16.9 8.4 0 0 0 0 4.6 0 12.8 0 0 0 0 0 0 1.7 10.4 0 3.1 1.7 0 25.2 5.8 0 0 0 0 10.1 4.9 0 0 23.3 7.7 1.5 3.7 4.1 3.5 0 11.1 0 1.6 10.1 0 1.4 0 0 0 0 0 0 0 5.9 4.6 2.6 0 3.1 "medium" "medium" "medium" "high__" "medium" "medium" "high__" "high__" "high__" "high__" "high__" "high__" "high__" "high__" "high__" "high__" "high__" "medium" "medium" "medium" "medium" "high__" "high__" "high__" "low___" "medium" "medium" "medium" "medium" "high__" "high__" "high__" "high__" "medium" "high__" "medium" "high__" "high__" "medium" "medium" "medium" "medium" "high__" "high__" "medium" "high__" "high__" "medium" "medium" "high__" "low___" "high__" "high__" "medium" "medium" "low___" "low___" "low___" "high__" "high__" "medium" "medium" "high__" "high__" "low___" "low___" "medium" "high__" "medium" "high__" "medium" "medium" "medium" "low___" "low___" "medium" "medium" "high__" "high__" "low___" "high__" "high__" "high__" "high__" "high__" "medium" "medium" "low___" "low___" "low___" "low___" "low___" "high__" "high__" "low___" "low___" "low___" "medium" "medium" "medium" "high__" "medium" "medium" "medium" "high__" "high__" "low___" "low___" "medium" "medium" "medium" "medium" "medium" "medium" "high__" "high__" "high__" "high__" "high__" "high__" "high__" "high__" "high__" "medium" "medium" "medium" "medium" "high__" "high__" "low___" "low___" "low___" "low___" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "small_" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "medium" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "large_" "small_" "small_" "small_" "small_" "medium" "small_" "small_" "small_" "small_" "large_" "large_" "large_" "large_" "large_" "medium" "large_" "large_" "large_" "large_" "summer" "winter" "summer" "spring" "spring" "summer" "spring" "autumn" "summer" "autumn" "summer" "autumn" "summer" "summer" "spring" "autumn" "summer" "autumn" "summer" "autumn" "spring" "autumn" "summer" "summer" "autumn" "winter" "summer" "spring" "autumn" "summer" "autumn" "autumn" "winter" "summer" "winter" "winter" "winter" "autumn" "winter" "spring" "autumn" "summer" "autumn" "summer" "summer" "summer" "spring" "spring" "autumn" "autumn" "summer" "autumn" "spring" "autumn" "summer" "autumn" "spring" "spring" "spring" "autumn" "autumn" "winter" "autumn" "spring" "winter" "autumn" "autumn" "autumn" "autumn" "autumn" "summer" "winter" "summer" "spring" "autumn" "spring" "autumn" "winter" "summer" "summer" "spring" "summer" "summer" "winter" "spring" "winter" "summer" "winter" "winter" "summer" "autumn" "winter" "spring" "summer" "winter" "spring" "summer" "spring" "winter" "summer" "winter" "winter" "spring" "autumn" "spring" "autumn" "autumn" "spring" "winter" "summer" "summer" "spring" "spring" "autumn" "summer" "autumn" "winter" "spring" "summer" "winter" "summer" "winter" "spring" "spring" "summer" "winter" "summer" "winter" "summer" "winter" "winter" "summer" "autumn" "season" "river_size" "fluid_velocity" "concentration_1" "concentration_2" "concentration_3" "concentration_4" "concentration_5" "concentration_6" "concentration_7" "concentration_8" "algae_6" "season" "river_size" "fluid_velocity" "double0" "nominal:autumn,spring,summer,winter" "nominal:large_,medium,small_" "nominal:high__,low___,medium" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"