Pointers to: On-Line Software for Clustering and Multivariate Analysis This is a short review of programs and packages available for public access, by anonymous ftp, Gopher or World-Wide Web. It takes the form of cuts-and-pastes from newsgroup postings or email messages. No attempt has been made to list codes which can be had by directly contacting the author. No attempt has been made (so far) to list system-specific sites (e.g. SAS, XLisp-Stat). No attempt has been made (again, so far) to list commercial or shareware codes. No guarantees are given nor implied in respect to software referred to here. F. Murtagh (fd.murtagh@ulst.ac.uk), May 1994. Updates Sept. 1994, July 1995, October 1996, February 1997, March 1997. ----------------------------------------------------------------- Statlib: a major site for statistical software of all sorts. Gopher to lib.stat.cmu.edu Anonymous ftp to lib.stat.cmu.edu URL: http://lib.stat.cmu.edu/ Here are some areas to check out: CMLIB - Core Mathematics Library from NIST. CLUSTER "is a sublibrary of Fortran subroutines for cluster analysis and related line printer graphics. It includes routines for clustering variables and/or observations using algorithms such as direct joining and splitting, Fisher's exact optimization, single-link, K-means, and minimum mutations, and routines for estimating missing values. The subroutines in CLUSTER are described in the book "Clustering Algorithms" by J. A. Hartigan." APSTAT - Selected Algorithms Transcribed from Applied Statistics. Mostly Fortran. Includes implementations of: minimal spanning tree, single-link hierarchical clustering, discriminant analysis of categorical data, branch and bound algorithm for feature subset selection, etc. GENERAL - Software of General Statistical Interest. Includes the 3-d interactive data display package, XGobi. Algorithms for convex hull, and Delaunay triangulation. Mclust, model-based clustering routines (Banfield and Raftery). MVE, minimum volume ellipsoid estimator (Rousseeuw), PROGRESS, robust regression (Rousseeuw and Leroy), MARS, projection pursuit. Nonlinear discriminant analysis. LOESS regression. Etc. MULTI - Multivariate Analysis and Clustering. Hierarchical clustering, principal components analysis, discriminant analysis. Former are mainly Fortran. Macintosh programs for multivariate data analysis and graphical display, linear regression with errors in both variables, software directory including details of packages for phylogeny estimation and to support consensus clustering. ----------------------------------------------------------------- Netlib: a major site for numerical analysis software, including eigenvalue/vector packages EISPACK, SVDPACK, etc. Anonymous ftp to netlib.att.com ----------------------------------------------------------------- Tooldiag (Thomas W. Rauber): "an experimental package for the analysis and visualization of sensorial data. It permits the selection of features for a supervised learning task, the error estimation of a classifier based on continuous multidimensional feature vectors and the visualization of the data". It can be obtained via anonymous, binary ftp from ftp.fct.unl.pt - pub/di/packages as tooldiag1.5.tar.Z. ----------------------------------------------------------------- ALN (Adaptive Logic Network; William W. Armstrong, Dept. of Comp. Sci., University of Alberta. arms@cs.ualberta.ca): "belongs to the class of artificial neural systems. ... uses only simple logical functions AND, OR, and NOT. In hardware, computations would be done in parallel in a tree of combinational logic gates." Demonstration software in C-source form is available to researchers for non-commercial purposes only. (Contact author.) ----------------------------------------------------------------- Cluster (Andreas Stolcke, stolcke@ICSI.berkeley.edu): "cluster utility. ... performs Hierarchical Cluster Analysis (HCA) on a set of vectors and outputs the result in a variety of formats on standard output. ... performs Principal Component Analysis (PCA) on a set of vectors and prints the transformed set of vectors on standard output." Available by anonymous ftp from icsi-ftp.herkeley.edu (128.32.201.55), cd pub/ai Program is cluster-2.2.tar.Z ----------------------------------------------------------------- Voronoi diagram/Delaunay triangulation: Summary of responses to message in Vision-List Digest (20 April 1994) - see below for compiler, and subscription details to this Digest: Algorithm by Steve Fortune is available from netlib@research.att.com Use: "send sweep2 from voronoi" The alg calculates both Voronoi and Delaunay diagrams. Quickhull by anonymous ftp from geom.umn.edu get /pub/software/qhull.tar.Z The alg calculates the Delaunay triangulation and convex hull. nnsort.c Dave Watson sent me a copy of nnsort.c which computes the Delaunay triangulation and convex hull in 2D and 3D. deltree.c Olivier Devillers sent a copy of deltree.c which computes the Voronoi/Delaunay diagrams and also has a function that returns the nearest neighbour pt. in the diagram to any arbitarily chosen point. He also includes an interactive interface in SunView. (Comments in French) Books: "Computational Geometry in C", by Joseph O'Rourke, Cambridge University Press, 1994, ISBN 0-521-44592-2. This has complete programs for Voronoi/Delaunay diagrams. [Msg. from feisal@ldc.uwi.tt, in moderated Vision-List Digest membership requests to vision-list-request@teleos.com] 3-d voronoi diagrams: vcs (John M. Sullivan, Geometry Center, Univ. Minn.; sullivan@geom.umn.edu): "code for 3-d voronoi diagrams". Available by anonymous ftp from: geom.umn.edu:pub/vcs.tar.Z ----------------------------------------------------------------- [Message dealing with CART-type methods:] Newsgroups: sci.stat.math,sci.stat.edu,sci.stat.consult From: saswss@hotellng.unx.sas.com (Warren Sarle) Date: Sun, 11 Sep 1994 18:35:20 GMT In "CART- Classification and Regression Trees", sci.stat.math article <34t1t0$m2i@search01.news.aol.com>, ajhorovitz@aol.com (AJHorovitz) writes: |> |> CART-Classification and Regression Trees (Algorithms produced by |> California Statistical Software (Breiman, et al, 1984) and Interface by |> SALFORD SYSTEMS) |> ... |> CART is a new tree structured statistical analysis program that can |> automatically search for and find the hidden structure in your data. Based |> on the original work of some of the world's leading statisticians, CART is |> the only "stand-alone" tree-based program that can give you statistically |> valid results. Since the task of distributing information on empirical decision tree methodology seems to have fallen on me, I feel I should correct the misinformation in the post quoted above. I asked AJHorovitz whether he intended to say that FIRM, Knowledge Seeker, and Data Splits (to name but a few '"stand-alone" tree-based programs') are statistically invalid. The gist of his reply was that only cross-validation yields statistically valid results. FIRM and Knowledge Seeker do multiplicity-adjusted significance tests. While some statisticians have philosophical objections to significance tests, branding significance tests as invalid in advertising literature strikes me as misleading as anything in the Systat/Statistica debate. Even if we grant that significance tests are statistically invalid, we are left with the fact that Data Splits and IND both do the same kind of cross-validation as CART does. So the claim that 'CART is the only "stand-alone" tree-based program that can give you statistically valid results' is clearly incorrect. I set follow-ups to sci.stat.edu, since that is where the recent debate on statistical software marketing has been going on. Here is my summary of empirical decision tree software. I updated the information on CART to give Salford Systems address. .................................................................. There are many algorithms and programs for computing empirical decision trees. Several families can be identified with typical characteristics as listed below: The CART family: CART, tree (S), etc. Motivation: statistical prediction. Exactly two branches from each nonterminal node. Cross-validation and pruning are used to determine size of tree. Response variable can be quantitative or nominal. Predictor variables can be nominal or ordinal, and continuous predictors are supported. The CLS family: CLS, ID3, C4.5, etc. Motivation: concept learning. Number of branches equals number of categories of predictor. Only nominal response and predictor variables are supported in early versions, although I'm told that the latest version of C4.5 supports ordinal predictors The AID family: AID, THAID, CHAID, MAID, XAID, FIRM, TREEDISC, etc. Motivation: detecting complex statistical relationships. Number of branches varies from two to the number of categories of predictor. Statistical significance tests (with multiplicity adjustments in the later versions) are used to determine size of tree. AID, MAID, and XAID are for quantitative responses. THAID, CHAID, and TREEDISC are for nominal responses, although the version of CHAID from Statistical Innovations, distributed by SPSS, can handle a quantitative categorical response. FIRM comes in two varieties for categorical or continuous response. Predictors can be nominal or ordinal and there is usually provision for a missing-value category. Some versions can handle continuous predictors, others cannot. There are also a variety of methods that do splits on linear combinations rather than single predictors. I have not yet constructed a taxonomy for such methods. Some programs combine two or more families. For example, IND combines methods from CART and C4 as well as Bayesian and minimum encoding methods. Knowledge Seeker combines methods from CHAID and ID3 with a novel multiplicity adjustment. There are numerous unresolved statistical issues regarding these methods. Perhaps the most important is how big should the tree be? CART supporters claim that its pruning method using cross-validation is superior to the significance testing method used in the AID family. However, pruning is very easy and quick to do in the AID family since the p-values are computed while growing the tree and no cross-validation is required for pruning. The validity of CART cross-validation is suspect because CART seems to produce much smaller trees than the AID family, even using very conservative significance levels for the latter, which one would expect to validate well although empirical evidence is scarce. I have not seen any published comparison of CART and AID methods. This would make an excellent topic for a thesis. Some references: Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984), _Classification and Regression Trees_, Wadsworth: Belmont, CA. Chambers, J.M. amd Hastie, T.J. (1992), _Statistical Models in S_, Wadsworth & Brooks /Cole: Pacific Grove, CA. Hawkins, D.M. & Kass, G.V. (1982), "Automatic Interaction Detection", in Hawkins, D.M., ed., _Topics in Applied Multivariate Analysis_, 267-302, Cambridge Univ Press: Cambridge. Morgan & Messenger (1973) _THAID--a sequential analysis program for the analysis of nominal scale dependent variables_, Survey Research Center, U of Michigan. Morgan & Sonquist (1963) "Problems in the analysis of survey data and a proposal", JASA, 58, 415-434. (Original AID) Morton, S.C. (1992) "New advances in statistical dendrology", Chance, 5, 76-79. See also letter to editor in volume 6 no. 1. Quinlan, J.R. (1993), _C4.5: Programs for Machine Learning_, Morgan Kaufman: San Mateo, CA. The following information on software sources has been culled from previous posts and may be out of date or inaccurate: C4.5 C source code for a new, improved decision tree algorithm known as C4.5 is in the new book by Ross Quinlan (of ID3 fame). "C4.5: Programs for Machine Learning", Morgan Kaufmann, 1992. It goes for $44.95. With accompanying software on magnetic media it runs for $69.95. ISBN # 1-55860-238-0 CART Salford Systems, 341 N44th Street #711, Lincoln NE 68503, USA. Academic price is $399.00 (US). SYSTAT Corporation distributes a PC version of CART. They can be reached at SYSTAT, Inc., 1800 Sherman Avenue, Evanston, IL 60201, USA. Phone: (708) 864-5670, FAX: (708) 492-3567. CHAID PC version from SPSS (800) 543-5831. Mainframe version from Statistical Innovations Inc., 375 Concord Avenue Belmont, Mass. 02178 Data Splits From Padraic Neville (510) 787-3452, $10 for preliminary release. FIRM `FIRM Formal Inference-based Recursive Modeling', University of Minnesota School of Statistics Technical Report #546, 1992. The writeup and a diskette containing executables is available from the U of M bookstore for $17.50. Incredible bargain! IND Version 2.0 should be available soon at a modest price from NASAs COSMIC center in Georgia, USA. Enquiries should be directed to: mail (to customer support): service@cossack.cosmic.uga.edu Phone: (706) 542-3265 and ask for customer support FAX: (706) 542-4807. Knowledge Seeker Phone 613 723 8020. PC-Group is available from Austin Data Management Associates, P.O. Box 4358, Austin, TX 78765, (512) 320-0935. It runs on IBM and compatible personal computers with 512K of memory, and costs $495. A free demo version of the program is available upon request. tree S: phone 800 462-8146 TREEDISC SAS macro using SAS/IML and SAS/OR available free from SAS Institute technical support (919) 677-8000. -- Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 677-8000 Cary, NC 27513, USA those of SAS Institute. ------------------------------------------------------------------------------ From: Ronny Kohavi Date: Tue, 24 Jan 1995 MLC++, a Machine Learning library in C++. MLC++ is a library of C++ classes and tools for supervised Machine Learning being developed at the Robotics lab in Stanford University. Ronny Kohavi (ronnyk@CS.Stanford.EDU, http://robotics.stanford.edu/~ronnyk) ------------------------------------------------------------------------------ From: jmerelo@kal-el.ugr.es (J.J. Merelo Guervos) Date: 30 Dec 1994 11:27:34 GMT Subject: Announcing S-LVQ 1.0.1 Dear fellow netters: After getting some bug reports from users, I have fixed S-LVQ and produced a new version, which is basically a bug fix from 1.0. Here is the blurb S-LVQ is a quite simple program to perform Kohonen's LVQ algorithm. I know there is a very good program already made by Kohonen's team (LVQ_PAK), but, anyways, I had done it for my own purposes and thought it would be a good idea to release it into the public domain; it could be useful to somebody. Some features: -Command line interface to set the training file, test or validation file, number of neurons and number of epochs. -Easy file setup -Graphics interface written in TCL/TK, whichs allows to set the parameters and visualizes the results, as points if the training/weight vectors are 2-dimensional, and as lines if it is not. Changes from version 1.0: -Autoconfiguration -Bug fixes for Sun SPARCstations. If you want to know more about Kohonen's LVQ, this is the main reference: Kohonen, T.; "The Self-Organizing Map", Procs. IEEE, vol. 78, pp. 1464- 1480, 1990. ------------------------------ It's available from the usual sources, that is 1. FTP: get it at ftp://kal-el.ugr.es/pub/s-lvq-1.0.1.tar.gz 2. ftpmail: use your favorite ftpmail server, or send a message to ftpmail@kal-el.ugr.es with the body open get s-lvq close You'll receive an uu-encoded version of the former program 3. WWW: connect to GeNeura's home page at http://kal-el.ugr.es/geneura.html, and follow instructions. -- Dr. JJ Merelo Grupo Geneura ---- Univ. Granada ------------------------------------------------------------------------------- Some time ago we released the software package "LVQ_PAK" for the easy application of Learning Vector Quantization algorithms. Corresponding public-domain programs for the Self-Organizing Map (SOM) algorithms are now available via anonymous FTP on the Internet. "What does the Self-Organizing Map mean?", you may ask --- See the following reference, then: Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464-1480, 1990. In short, Self-Organizing Map (SOM) defines a 'non-linear projection' of the probability density function of the high-dimensional input data onto the two-dimensional display. SOM places a number of reference vectors into an input data space to approximate to its data set in an ordered fashion. This package contains all the programs necessary for the application of Self-Organizing Map algorithms in an arbitrary complex data visualization task. This code is distributed without charge on an "as is" basis. There is no warranty of any kind by the authors or by Helsinki University of Technology. In the implementation of the SOM programs we have tried to use as simple code as possible. Therefore the programs are supposed to compile in various machines without any specific modifications made on the code. All programs have been written in ANSI C. The programs are available in two archive formats, one for the UNIX-environment, the other for MS-DOS. Both archives contain exactly the same files. These files can be accessed via FTP as follows: 1. Create an FTP connection from wherever you are to machine "cochlea.hut.fi". The internet address of this machine is 130.233.168.48, for those who need it. 2. Log in as user "anonymous" with your own e-mail address as password. 3. Change remote directory to "/pub/som_pak". 4. At this point FTP should be able to get a listing of files in this directory with DIR and fetch the ones you want with GET. (The exact FTP commands you use depend on your local FTP program.) Remember to use the binary transfer mode for compressed files. The som_pak program package includes the following files: - Documentation: README short description of the package and installation instructions som_doc.ps documentation in (c) PostScript format som_doc.ps.Z same as above but compressed som_doc.txt documentation in ASCII format - Source file archives (which contain the documentation, too): som_p1r0.exe Self-extracting MS-DOS archive file som_pak-1.0.tar UNIX tape archive file som_pak-1.0.tar.Z same as above but compressed An example of FTP access is given below unix> ftp cochlea.hut.fi (or 130.233.168.48) Name: anonymous Password: ftp> cd /pub/som_pak ftp> binary ftp> get som_pak-1.0.tar.Z ftp> quit unix> uncompress som_pak-1.0.tar.Z unix> tar xvfo som_pak-1.0.tar See file README for further installation instructions. All comments concerning this package should be addressed to som@cochlea.hut.fi. ------------------------------------------------------------------------------- Date: Mon, 20 Feb 1995 08:01:37 +0000 From: "Warren L. Kovach" Subject: WWW: Statistical and data analysis software I am pleased to announce my new World Wide Web pages focusing on shareware and public domain statistical and data analysis software. The URL is: http://www.compulink.co.uk/kovcomp These pages provide detailed information about and shareware copies of my programs MVSP and Oriana. MVSP is a multivariate statistical program for MS-DOS that calculates a variety of cluster analyses as well as PCA, PCO, and correspondence/detrended correspondence analysis. Oriana is my new circular statistics/orientation analysis package for Windows. The pages also have a list of resources on the Internet related to statistical software. In particular, there are many links to WWW pages and FTP sites that have software. I hope to maintain a definitive list of sources of shareware and public domain software on the Internet. If you know of sites that are not yet on my list I would appreciate hearing about them. For a bit of fun, there is also a page with information about the Isle of Anglesey, in North Wales, the home of Kovach Computing Services, and links to other WWW pages about Wales. Come and learn how to pronounce one of the longest placenames in the world! -- Dr. Warren L. Kovach Internet: WarrenK@kovcomp.demon.co.uk Kovach Computing Services tel./fax: +44-(0)1248-450414 85 Nant-y-Felin, Pentraeth, Anglesey CompuServe: 100016,2265 Wales LL75 8UY U.K. WWW: http://www.compulink.co.uk/kovcomp ----------------------------------------------------------------------------- Message to CLASS-L list on 5 July 1995: Re fuzzy clustering, how about probabilistic clustering?: i.e. we give a number of classes and then each data "thing" is probabilistically assigned to the various classes. Wallace founded the information-theoretic Minimum Message Length (MML) principle in 1968 (see also subsequent closely related work of Rissanen called 'MDL') with a clustering program called Snob. Snob is freely licensed for academic research, see Wallace and Dowe(1994) for details and many references, and see ~ftp/pub/snob/ on bruce.cs.monash.edu.au for Fortran source code. Some references to Snob (due to me, I believe) and other clustering algorithms (collated by Ray Liere) is given below. Doug Fisher's Cobweb algorithm is not mentioned by Ray Liere, presumably because Ray thought everyone on that mailing list knew it. I mention Cobweb now, and apologise to anyone whose favoUrite algorithm has not been mentioned - and invite them to tell me or CLASS-L of it. Please feel free to e-mail me (David Dowe, dld@cs.monash.edu.au) for further info on Snob or on MML. Please flame no-one :-) . Regards (and further info follows). - David Dowe. > >From owner-inductive@hermes.csd.unb.ca Tue May 30 09:54:08 1995 >Date: Mon, 29 May 1995 20:48:55 -0300 >From: Ray Liere >Subject: Summary: Unsupervised Conceptual Clustering >To: Multiple recipients of list INDUCTIVE > >A few days ago (24 May), I posted a request for ideas on unsupervised >conceptual clustering, especially methods that are not based on the >assumption that each data object is categorized into exactly one >of the clusters. > >As you have seen, some responses were posted directly to this list. >I have also received several email replies. > >My thanks to everyone for the very constructive assistance. I received >many good leads to explore. > >And ... following is the promised summary of email responses that I received: >===== >>From: Chunyu Kit >> i am doing machine learning of NL grammar rules. i need an >> appropriate clustering approach to classify the higher categories >> found into some clusters that are expected to have some >> kind of correspondence to those ones in linguistic theories, >> like NP, PP, etc. >===== >>From: Daniel Fu >> there's a system OLOC (Overlapping concepts) that was described >> in the Machine Learning Journal maybe a year ago. It's shares a lot >> with COBWEB. >===== >>From: blw@utrc.utc.com (Brad Whitehall) >> Look at the CLUSTER and CLUSTER/s systems of Stepp and Michalski. >> They actually went to great pains to make it so clusters did NOT overlap. >> Michalski is now at George Mason University and might even be able >> to supply you with some code. >> >> I would also look at Fuzzy Clustering. I think you might find it much >> more useful for the types of problems described in your note. >===== >>From: dld@bruce.cs.monash.edu.au (David L Dowe) >> Chris Wallace developed Minimum Message Length (MML) in 1968, developing >> the Snob program for unsupervised conceptual clustering and also applying >> it to a real world problem of seal skulls in the same, 1968 paper. >> >> The most recent Snob reference is >> C S Wallace and D L Dowe, "Intrinsic classification by MML - the Snob >> program", Proc. 7th Australian Joint Conference on Artificial Intelligence >> (UNE, Armidale, NSW, Australia, November 1994), World Scientific, pp 37-44. >> >> and you might wish to look at ~ftp/pub/snob/ on bruce.cs.monash.edu.au . >> >> See also: >> C.S. Wallace, `Classification by Minimum-Message-Length >> Inference' S.G. Akl et al (eds.) Advances in Computing and >> Information - ICCI'90, Niagara Falls, Lecture Notes in Computer >> Science, No.468, Springer-Verlag, pp 72-81, 1990. >> >> Wallace.C.S., `An Improved Program for Classification', ACSC-9, vol 8, no >> 1, pp 357-366, February 1986. >> >> Wallace C.S. & Boulton, D.M., `An Information Measure for >> Classification' \fIComputer Journal\fP, Vol.11, No.2, 1968, >> pp 185-194. >> >> MML is described in the 1968 paper and in >> Wallace.C.S, Freeman.P.R., `Estimation and Inference by Compact Coding', >> The Journal of the Royal Statistical Society, Series B, Methodology, 49, 3, >> 1987, pp 223-265. >> >> with some outline in Wallace and Dowe(1994) and introductory material in >> C S Wallace and D L Dowe, "MML estimation of the von Mises concentration >> parameter", Technical Report #93/193, Department of Computer Science, Monash >> University, Clayton 3168, Australia. >> >> Autoclass is similar to the 1990 Snob (see Wallace, 1990, pp 78-80). >> The only changes to Snob since (Wallace and Dowe, 1994) have been to permit >> Poisson and (von Mises) circular variables. >> >> Peter Cheeseman is a former student of Prof. Wallace. >> >> Snob permits over-lapping mixtures. In fact (Wallace and Dowe, 1994; >> and earlier Wallace Snob work) it can lead to statistically biassed answers >> if you don't. >===== >>From: RORWIG@BPA.ARIZONA.EDU (Richard E. Orwig) >> We've done conceptual clustering using a Hopfield net and Kohonen net on >> textual data. The Hopfield technique was reported in Chen, Hsu, Orwig, >> Hoopes, and Nunamaker in last year's October _Communications of the ACM_. >> >> My dissertation (completed this past month) reports the use of a Kohonen >> self-organizing map for textual clustering. It should hit the microfilm >> service in a couple of weeks. >> >> A major difference between the two is exactly your point -- the Hopfield >> neural net creates conceptual cluster headings and uses the keywords to >> organize the text documents. Documents containing keywords in two or more >> cluster headings will map to two or more respective clusters. The Kohonen >> algorithm, on the other hand, maps the document to its "best" region on a >> two-dimensional concept map. I've had the map define a conceptual region >> on the map with no data in it because the documents which all contained the >> concept fit better in other regions. >===== >>From: rbanerji@sjuphil.sju.edu (Ranan Banerji) >> All my life I had a problem with clustering. Any clustering method is >> based on some idea of similarity, proximity etc., be they numerical, >> symbolic or whatever. This similarity is determined by what the researcher >> considers similar. Very often in an application area we need to think of >> two objects as similar when they demand similar action, or some other >> problem dependent criterion of similarity. Whenever I have looked, it >> has seemed to me that the similarity imposed by the problem and the >> similarity imposed by the intuition is not the same. So the problem lies >> in getting a match between the two measures. The problem of computational >> complexity (which seems to be the thing bothering you) comes way after that. >> Refining the clustering method (to somehow get around the mismatch) is >> what gives rise to the complexity. I have spent my life trying to >> develop and improve methods for getting the correct match, i.e to >> solving the so-called "representation problem". My own advice would >> be, concentrate on sharpening your intuition of the problem so you can >> prove to yourself that your measure matches the measure imposed by the >> problem. Once you have done that, any fast-and-easy technique of >> clustering will work. >===== >>From: beatriz >> I do not agree Autoclass allows an object in only one class >> because it assigns probabilities to any object. >> One of the advantages of Autoclass is that works in domains >> with noise and overlapping classes. Ver: "Bayesian classification" >> P. Cheeseman et al, 1988. >===== > >Ray Liere >lierer@mail.cs.orst.edu ---------------------------------------------------------------------------- More on SNOB, Feb. 1997, from: (Dr.) David Dowe, Dept of Computer Science, Monash University, Clayton, Victoria 3168, Australia dld@cs.monash.edu.au Fax:+61 3 9905-5146 http://www.cs.monash.edu.au/~dld/ ftp://ftp.cs.monash.edu.au/software/snob/ http://www.cs.monash.edu.au/~dld/mixture.modelling.page.html ------ Snob: Software developed by Chris Wallace and David Dowe for mixture modelling and clustering using the information-theoretic Minimum Message Length (MML) principle. Snob deals with data from Gaussian, multi-nomial (Bernoulli), Poisson and von Mises circular distributions, and deals with missing data. Snob has software for non-commercial use, detailed documentation , a ReadMe file; with postscript latest paper being available. ---------------------------------------------------------------------------- Autoclass: http://ic-www.arc.nasa.gov/ic/projects/bayes-group/group/html/ autoclass-c-program.html Version 2.0, available 8 June 1995 (C code). Information on SNOB is also available at above site. ----------------------------------------------------------------------------- Multidimensional scaling (from message from F. Murtagh, June 1995): On Statlib (http://lib.stat.cmu.edu/), Fortran or C code, go to S and then to multiv, where a Sammon map program in Fortran is available. Under ripley there should be a better implementation, but maybe more more integrated into S (to be checked again...). For Netlib, go to http://www.netlib.org/, then 'The Netlib Repository', then mds for all the 1960s Bell Labs material. ------------------------------------------------------------------------------ Availability of hte ADDTREE/P and EXTREE programs (message from James E. Corter, jec34@COLUMBIA.EDU, to the CLASS-L list on 28 July 1995): Programs for fitting additive trees and extended trees to proximity data are now available commercially, and over the INTERNET in the form of PASCAL source code and DOS-executable code. The ADDTREE/P program for fitting additive trees incorporates a variant (Corter, 1982) of the basic Sattath & Tversky algorithm (Sattath & Tversky, 1977). The EXTREE program (Corter & Tversky, 1986) fits the extended tree model. A procedure based on the Sattath-Tversky-Corter algorithm for fitting additive trees is available in the latest release (version 6.0) of SYSTAT for DOS, available from SPSS Inc., 444 N. Michigan Avenue, Chicago, IL 60611 (312) 329-3500. Also, a standalone version (DOS-executable) of the ADDTREE/P program (Corter, 1982) written in the PASCAL language is available free of charge from the author. No support is available with this version, and there is a upper limit on the number of objects that can be modeled of 80. The EXTREE program for fitting extended trees is also available (maximum n = 32). Those with access to a file transfer program such as FTP on the INTERNET can retrieve the DOS-executable versions as follows. First, FTP to ftp.ilt.columbia.edu and login as "anonymous", then connect ("cd") to the directory "users/corter". The program and documentation files can then be retrieved with the usual GET command (be sure to set the file transfer type to "BINARY" before GETing the executable files). Gopher users can get the files by gophering to gopher.ilt.columbia.edu and connecting to "users/corter". Finally, PASCAL source code for the ADDTREE/P and EXTREE programs is maintained at an INTERNET site: the "netlib/mds" library at AT&T Bell Labs. This resource may be accessed via email, by sending a message to the INTERNET address netlib@research.att.com containing only the single line send readme index from mds REFERENCES Corter, J.E. (1982). ADDTREE/P: A PASCAL program for fitting additive trees based on Sattath & Tversky's ADDTREE program. Behavior Research Methods and Instrumentation, 14, 353-354. Corter, J.E., & Tversky, A. (1986). Extended similarity trees. Psychometrika, 51, 429-451. Sattath, S., & Tversky, A. (1977). Additive similarity trees. Psychometrika, 42, 319-345. ===================================== James E. Corter Dept. of Measurement, Evaluation, and Applied Statistics Teachers College, Columbia University New York, NY 10027 INTERNET: jec34@columbia.edu ===================================== ------------------------------------------------------------------------------ CLASS-L - 7 Aug 1995 to 23 Aug 1995 Date: Wed, 23 Aug 1995 18:58:09 +0200 From: Jean-Luc Voz Subject: ELENA Classification databases and technical reports available Dear colleagues, The partners of the Elena project are pleased to announce you the availability of several databases related to classification together with two technical reports. ELENA is an ESPRIT III Basic Research Action project (No. 6891) >From July 92 to June 95 the ELENA project investigated several aspects of classification by neural networks, including links between neural networks and Bayesian statistical classification, incremental learning,... The project includes theoretical work on classification algorithms, simulations and benchmarks, especially on realistic industrial data. Hardware implementation, especially VLSI option, is the last objective. The set of databases available is to be used for tests and benchmarks of machine-learning classification algorithms. The databases are splitted into two parts: ARTIFICIALly generated databases, mainly used for preliminary tests, and REAL ones, used for objective benchmarks and comparisons of methods. The choice of the databases has been guided by various parameters, such as availability of published results concerning conventional classification algorithms, size of the database, number of attributes, number of classes, overlapping between classes and non-linearities of the borders,... Results of PCA and DFA preprocessing of the REAL databases are also included, together with several measures useful for the databases characterization (statistics, fractal dimension, dispersion,...). All these databases and their preprocessing are available together with a postcript technical report describing in details the different databases ('Databases.ps.Z' - 45 pages - 777781 bytes) and a report related to the comparative benchmarking studies of various algorithms ('Benchmarks.ps.Z' - 113 pages - 1927571 bytes) well-known by the Statistical and Neural Network communities (MLP, RCE, LVQ, k_NN, GQC) or developped in the framework of the Elena project (IRVQ, PLS). A LaTeX bibfile containing more than 90 entries corresponding to the Elena partners bibliography related to the project is also available ('Elena.bib') in the same directory. All files are available by anonymous ftp from the following directory: ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases The databases are splitted into two parts: the 'ARTIFICIAL' ones, being generated in order to obtain some defined characteristics, and for which the theoretical Bayes error can be computed, and the 'REAL' ones, collected in existing real-world applications. The ARTIFICIAL databases ('Gaussian', 'Clouds' and 'Concentric') were generated according to the following requirements: - heavy intersection of the class distributions, - high degree of nonlinearity of the class boundaries, - various dimensions of the vectors, - already published results on these databases. They are restricted to two-class problems, since we believe it yield answers to the most essential questions. The ARTIFICIAL databases are mainly used for rapid test purposes on newly developed algorithms. The REAL databases ('Satimage', 'Texture', 'Iris' and 'Phoneme') were selected according to the following requirements: - classical databases in the field of classification (Iris), - already published results on these databases (Phoneme, from the ROARS ESPRIT project and 'Satimage' from the STATLOG ESPRIT project), - various dimensions of the vectors, - sufficient number of vectors (to avoid the ``empty space phenomenon''). - the 'Texture' database, generated at INPG for the Elena project is interesting for its high number of classes (11). ############################################################################## ########### # DETAILS # ########### The 'Benchmarks' technical report ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The 'Benchmarks.ps' Elena report is related to the benchmarking studies of various classifiers. Most of the classifiers which were used for the benchmark comparative studies are are well known by the neural network and machine learning community. These are the k-Nearest Neighbour (k_NN) classifier, selected for its powerful probability density estimation properties; the Gaussian Quadratic Classifier (GQC), the most classical statistical parametric simple classification method; the Learning Vector Quantizer (LVQ), a powerful non-linear iterative learning algorithm proposed by Kohonen; the Reduced Coulomb Energy (RCE) algorithm, an incremental Region Of Influence algorithm; the Inertia Rated Vector Quantizer (IRVQ) and the Piecewise Linear Separation (PLS) classifiers, developed in the framework of the Elena project. The main objectives of the 'Benchmarks.ps' Elena report report are the following: - to provide an overall comprehensive view of the general problem of comparative benchmarking studies and to propose a useful common test basis for existing and further classification methods, - to obtain objective comparisons of the different chosen classifiers on the set of databases described in this report (each classifier being used with its optimal configuration for each particular database), - to study the possible links between the data structures of the databases viewed by some parameters, and the behavior of the studied classifiers (mainly the evolution of their the optimal configuration parameters). - to study the links between the preprocessing methods and the classification algorithms from the performances and hardware constraints point of view (especially the computation times and memory requirements). Databases format ~~~~~~~~~~~~~~~ All the databases available are in the following format (after decompression) : - All files containing the databases are stored as ASCII files for their easy edition and checking. - In a file, each of the n lines is reserved for each vectorial sample (instance) and each line consists of d floating-point numbers (the attributes) followed by the class label (which must be an integer). Example: 1.51768 12.65 3.56 1.30 73.08 0.61 8.69 0.00 0.14 1 1.51747 12.84 3.50 1.14 73.27 0.56 8.55 0.00 0.00 0 1.51775 12.85 3.48 1.23 72.97 0.61 8.56 0.09 0.22 1 1.51753 12.57 3.47 1.38 73.39 0.60 8.55 0.00 0.06 1 1.51783 12.69 3.54 1.34 72.95 0.57 8.75 0.00 0.00 3 1.51567 13.29 3.45 1.21 72.74 0.56 8.57 0.00 0.00 1 There are NO missing values. If you desire to get a database, you MUST do it in ftp the binary mode. So if you aren't in this mode, simply type 'binary' at the ftp prompt. EXAMPLE: to get the "phoneme" database : cd REAL cd phoneme binary get phoneme.txt get phoneme.dat.Z get ... cd ... ... quit After your ftp session, you simply have to type 'uncompress phoneme.dat.Z' to get the uncompressed datafile. Contents of the 'ARTIFICIAL' directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The databases of this directory contain only the 'ARTIFICIAL' classification problems. The present 'ARTIFICIAL' databases are only two-class problems, since it yields answers to the most essential questions. For each problem, the confusion matrix corresponding to the theoretical Bayes boundary is provided with the confusion matrix obtained by a k_NN classifier (k chosen to reach the minimum of the total Leave-One-Out error). These databases were selected to use for preliminary test and to study the behavior of the implemented algorithms for some particular problems: - Overlapping classes: The classifier should have the ability to form a decision boundary that minimizes the amount of misclassification for all of the overlapping classes. - Nonlinear separability: The classifier should be able to build decision regions that separate classes of any shape and size. There is one subdirectory for each database. In this subdirectory, there is : - A text file providing detailed information about the related database ('databasename.txt'). - The compressed database ('databasename.dat.Z). The different patterns of each database are presented in a random order. - For bidimensional databases, a postscript file representing the 2-D datasets (those files are in eps format). For each subdirectory, the directoryname is the same as the name chosen for the concerned database. Here are the directorynames with a brief description. - 'clouds' Bidimensional distributions : the class 0 is the sum of three different normal distributions while the the class 1 is another normal, overlapping the class 0. 5000 patterns, 2500 in each class. This allows the study of the classifier behavior for heavy intersection of the class distributions and for high degree of nonlinearity of the class boundaries. - 'gaussian' A set of seven databases corresponding to the same problem, but with dimensionality ranging from 2 to 8. This allows the study of the classifier behavior for different dimensionalities of the input vectors, for heavy overlapped distributions and for non linear separability. Theses databases where already studied by Kohonen in: Kohonen, T. and Barna, G. and Chrisley, R., "Statistical Pattern Recognition with Neural Networks: Benchmarking Studies", IEEE Int. Conf. on Neural Networks, SOS Printing, San Diego, 1988. In this paper,the performances of three basis types of neural-like networks (Backpropagation network, Boltzmann machine and Learning Vector Quantization) is evaluated and compared to the theoretical limit. - 'concentric' Bidimensional uniform concentric circular distributions. 2500 instances, 1579 in class 1, 921 in class 0. This database may be used to study the linear separability of the classifier when some classes are nested in other without overlapping. Contents of the 'REAL' directory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The databases of this directory contain only the real classification problem sets selected for the Elena benchmarking studies. There is one subdirectory for each database. In this subdirectory, there are: - a text file giving detailed information about the related database (`databasename.txt'), - the compressed original database in the Elena format (`databasename.dat.Z'); the different patterns of each database being presented in a random order. - By the way of a normalization process, each original feature will have the same importance in a subsequent classification process. A typical method is first to center each feature separately and than to reduce it to a unit variance; this process has been applied on all the REAL Elena databases in order to build the ``CR'' databases contained in the ``databasename_CR.dat.Z'' files. The Principal Components Analysis (PCA) is a very classical method in pattern recognition [Duda73]. PCA reduces the sample dimension in a linear way for the best representation in lower dimensions keeping the maximum of inertia. The best axe for the representation is however not necessary the best axe for the discrimination. After PCA, features are selected according to the percentage of initial inertia which is covered by the different axes and the number of features is determined according to the percentage of initial inertia to keep for the classification process. This selection method has been applied on every REAL database after centering and reduction (thus on the databasename_CR.dat files). When quasi-linear correlations exists between some initial features, these redundant dimensions are removed by PCA and this preprocessing is then recommended. In this case, before a PCA, the determinant of the data covariance matrix is near zero; this database is thus badly conditioned for all process which use this information (the quadratic classifier for example). The following files, related to PCA are also available for the REAL databases: - ``databasename_PCA.dat.Z'', the projection of the ``CR'' database on its principal components (sorted in a decreasing order of the related inertia percentage), - ``databasename_corr_circle.ps.Z'', a graphical representation of the correlation between the initial attributes and the two first principal components, - ``databasename_proj_PCA.ps.Z'', a graphical representation of the projection of the initial database on the two first principal components, - ``databasename_EV.dat'', a file with the eigenvalues and associated inertia percentages The Discriminant Factorial Analysis (DFA) can be applied to a learning database where each learning sample belongs to a particular class [Duda73]. The number of discriminant features selected by DFA is fixed in function of the number of classes (c) and of the number of input dimensions (d); this number is equal to the minimum between d and c-1. In the usual case where d is greater than c, the output dimension is fixed equal to the number of classes minus one and the discriminant axes are selected in order to maximize the between-variance and to minimize the within-variance of the classes. The discrimination power (ratio of the projected between-variance over the projected within-variance) is not the same for each discriminant axis: this ratio decreases for each axis. So for a problem with many classes, this preprocessing will not be always efficient as the last output features will not be so discriminant. This analysis uses the information of the inverse of the global covariance matrix, so the covariance matrix must be well conditioned (for example, a preliminary PCA must be applied to remove the linearly correlated dimensions). The DFA preprocessing method has been applied on the 18 first principal components of the 'satimage_PCA' and 'texture_PCA' databases (thus by keeping only the 18 first attributes of these databases before to apply the DFA preprocessing) in order to build the 'satimage_DFA.dat.Z' and 'texture_DFA.dat.Z' database files, having respectively 5 and 10 dimensions (the 'satimage' database having 6 classes and 'texture' 11). For each subdirectory, the directoryname is the same as the name chosen for the contained database. Here are the directorynames with a brief numerical description of the available databases. - phoneme French and Spannish phoneme recognition problem. The aim is to distinguish between nasal (AN, IN, ON) and oral (A, I, O, E, E') vowels. 5404 patterns, 5 attributes (the normalized amplitudes of the five first harmonics), 2 classes. This database was in use in the European ESPRIT 5516 project ROARS. The aim of this project is the development and the implementation of a REAL time analytical system for French and Spannish phoneme recognition. - texture The aim is to distinguish between 11 different textures (Grass lawn, Pressed calf leather, Handmade paper, Raffia looped to a high pile, Cotton canvas, ...), each pattern (pixel) being characterised by 40 attributes built by the estimation of fourth order modified moments in four orientations: 0, 45, 90 and 135 degrees. 5500 patterns, 11 classes of 500 instances (each class refers to a type of texture in the Brodatz album). The original source of this database is: P. Brodatz "Textures: A Photographic Album for Artists and Designers", Dover Publications, Inc., New York, 1966. This database was generated by the Laboratory of Image Processing and Pattern Recognition (INPG-LTIRF Grenoble, France) in the development of the Esprit project ELENA No. 6891 and the Esprit working group ATHOS No. 6620. - satimage (*) Classification of the multi-spectral values of an image of the Landsat satellite. Each line contains the pixel values in four spectral bands of each of the 9 pixels in a 3x3 neighbourhood and a number indicating the classification label of the central pixel (corresponding to the type of soil: red soil, cotton crop, grey soil, ...). The aim is to predict this classification, given the multi-spectral values. 6435 instances, 36 attributes (4 spectral bands x 9 pixels in neighbourhood), 6 classes. This database was in use in the European StatLog project, which involves comparing the performances of machine learning, statistical, and neural network algorithms on data sets from REAL-world industrial areas including medicine, finance, image analysis, and engineering design: D. Michie, D.J. Spiegelhalter, and C.C. Taylor, editors. Machine learning, Neural and Statistical Classification. Ellis Horwood Series In Artificial Intelligence, England, 1994. - iris (*) This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. 4 attributes (sepal length, sepal width, petal length and petal width). (*) These databases are taken from the ftp anonymous "UCI Repository Of Machine Learning Databases and Domain Theories" (ics.uci.edu: pub/machine-learning-databases): Murphy, P. M. and Aha, D. W. (1992). "UCI Repository of machine learning databases" [Machine-readable data repository]. Irvine, CA: University of California, Department of Information and Computer Science. [Duda73] Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, John Wiley & Sons, 1973. ############################################################################## The ELENA PROJECT ~~~~~~~~~~~~~~~~ Neural networks are now known as powerful methods for empirical data analysis, especially for approximation (identification, control, prediction) and classification problems. The ELENA project investigates several aspects of classification by neural networks, including links between neural networks and Bayesian statistical classification, incremental learning (control of the network size by adding or removing neurons),... URL: http://www.dice.ucl.ac.be/neural-nets/ELENA/ELENA.html ELENA is an ESPRIT III Basic Research Action project (No. 6891). It involves: INPG (Grenoble, F), UPC (Barcelona, E), EPFL (Lausanne, CH), UCL (Louvain-la-Neuve, B), Thomson-Sintra ASM (Sophia Antipolis, F) EERIE (Nimes, F). The coordinator of the project can be contacted at: Prof. Christian Jutten, INPG-LTIRF, 46 av. Flix Viallet, F-38031 Grenoble Cedex, France Phone: +33 76 57 45 48, Fax: +33 76 57 47 90, e-mail: chris@tirf.inpg.fr A simulation environment (PACKLIB) has been developed in the project; it is a smart graphical tool allowing fast programming and interactive analysis. The PACKLIB environment greatly simplifies the user's task by requiring only to write the basic code of the algorithms, while the whole graphical input, output and relationship framework is handled by the environment itself. PACKLIB is used for extensive benchmarks in the ELENA project and in other situations (image processing, control of mobile robots,...). Currently, PACKLIB is tested by beta users and a demo version available in the public domain. URL: http://www.dice.ucl.ac.be/neural-nets/ELENA/Packlib.html ############################################################################## IF YOU HAVE ANY PROBLEM, QUESTION OR PROPOSITION, PLEASE E_MAIL the following. VOZ Jean-Luc or Michel Verleysen Universite Catholique de Louvain DICE - Lab. de Microelectronique 3, place du Levant B-1348 LOUVAIN-LA-NEUVE E_mail : voz@dice.ucl.ac.be verleysen@dice.ucl.ac.be ------------------------------------------------------------------------------ *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* | | * Netlib/MDS is a collection of FREE programs for multidimensional * | scaling and related methods. | * * | -- NEW: Four entries covering PREFMAP3, SINDSCAL, and KYST2 | * * | -- NEW: Several DOS executable files | * * | -- Programs may be obtained by email, ftp, and web browser. | * * *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* Netlib/MDS is a collection of programs having to do with multidimensional scaling and related methods, including PREFMAP, SINDSCAL (INDSDCAL), ADDTREE, EXTREE, KYST, MDSCAL, HICLUST, and MDPREF (some in multiple versions). Netlib/MDS is one of many libraries (currently about 140) which are maintained at and distributed by Netlib at several sites around the world. For further information, send email containing only this line send readme from mds to netlib@netlib.bell-labs.com Our thanks to Patrick Groenen (Leiden University, The Netherlands), Phipps Arabie (Rutgers University, USA), Jacqueline Meulman (Leiden University, The Netherlands), for providing the new programs, and to Joaquin Sanchez (Complutense University, Spain) for other help. <>----------------<>----------------<>----------------<>-----------------<> Joseph B Kruskal, Bell Labs, Lucent Technologies Room 2C-281, Murray Hill, NJ 07974 EMAIL kruskal@research.bell-labs.com PHONE 908-582-3853 FAX 908-582-2379 HOMEPAGE http://cm.bell-labs.com/cm/ms/departments/sia/kruskal/index.html <>----------------<>----------------<>----------------<>-----------------<> ------------------------------------------------------------------------------ Maria Wolters asked: > I'm looking for public domain classification tree induction software. > Our target data is linguistic (letters & part-of-speech tags). the Other Phylogeny Programs web page at our PHYLIP web site lists 88 packages (yes, there are that many!), many of them freely copyable. It also has a link to the Classification Society's list of freely copyable classification software. The URL is: http://evolution.genetics.washington.edu/phylip/software.html -- Joe Felsenstein joe@genetics.washington.edu (IP No. 128.95.12.41) Dept. of Genetics, Univ. of Washington, Box 357360, Seattle, WA 98195-7360 USA ------------------------------------------------------------------------------ From: "Ted E. Dunning" Subject: Re: Classification tree software for symbolic data ... Maria Wolters wants decision tree software ... look at the following http pages http://www.sgi.com/Technology/mlc/trees.html http://www.cs.jhu.edu/~salzberg/announce-oc1.html ------------------------------------------------------------------------------ Date: Tue, 11 Mar 97 13:44:51 -0800 From: raftery@stat.washington.edu To: mclust@stat.washington.edu Subject: New model-based clustering software and papers Several new pieces of software and papers on model-based clustering are now available over the Web, produced by the MCLUST project at the University of Washington. They can be accessed from http://www.stat.washington.edu/raftery/Research/Mclust/mclust.html (click on "Papers" or "Software"). The new software is: * mclust-em: 2-dimensional model-based clustering with clutter using the EM algorithm * Principal Curve Clustering Software * Nearest Neighbor Cleaning of Spatial Point Processes The new papers are: * Principal Curve Clustering with Noise. Derek Stanford and Adrian E. Raftery. * Non-parametric Maximum Likelihood Estimation of Features in Spatial Point Processes Using Voronoi Tesselation (revised version). Denis Allard and Chris Fraley * Linear Flaw Detection in Woven Textiles using Model-Based Clustering. John G. Campbell, Chris Fraley, Fionn Murtagh and Adrian E. Raftery. * Algorithms for Model-Based Gaussian Hierarchical Clustering. Chris Fraley. * Nearest Neighbor Clutter Removal for Estimating Features in Spatial Point Processes. Simon Byers and Adrian E. Raftery ------------------------------------------------------------------------------ Date: Thu, 13 Mar 1997 10:04:44 +1300 From: Murray Jorgensen Yet more model-based clustering software Emboldened by the announcement of the MCLUST project group at the University of Washington the MULTIMIX group at the University of Waikato (Lynette Hunt and Murray Jorgensen) announce the availability of the MULTIMIX program, which clusters data having both categorical and continuous variables, possibly containing missing observations. The class of models fitted is described in the (Plain) TeX code which follows and generalizes both Latent Class Analysis and Mixtures of Multivariate Normals. We hope soon to have this software available on our ftp site. If you are interested in downloading this software please send us you email address and we will notify you when the program will be available. ------------------------------------------------------------------------------ Date: Wed, 12 Mar 1997 13:39:13 -0800 (PST) From: Jan Deleeuw Let me point out once again that for projects like this the Journal of Statistical Software is a nice repository. Statlib is a zoo, without any proper organization. JSS provides peer review, nice formatting, guestbooks for comments, demos when appropriate, code testing by reviewers. Moreover JSS gets hundreds of hits each day. Of course authors maintain copyright, i.e. they can put code in statlib, on their own ftp servers, sell it, whatever, in addition to submitting to JSS. See http://www.stat.ucla.edu/journals/jss/v01/i04/ for a recent clustering example (still partly under conbstruction). ------------------------------------------------------------------------------