60 Replies Latest reply: Mar 1, 2009 12:00 AM by PhilipLeitch RSS

    Cluster Analysis - Agglomerative Nesting (AGNES)

    PhilipLeitch Copper
      AGGLOMERATIVE NESTING (AGNES) USING THE GROUP AVERAGE METHOD OF SOKAL AND MICHENER (1958)

      As adapted from "L. Kaufman and P.J. Rousseeuw (1990), FINDING GROUPS IN DATA: AN INTRODUCTION TO CLUSTER ANALYSIS, New York: John Wiley." Chapter 5.

      For those who believe all statistics are lies - rest assured there is no significance testing here - the decision for what is a seperate group and what is not is purely up to the researcher's own discretion.

      Philip
      ___________________
      Correct answers don't require correct spelling.
        • Cluster Analysis - Agglomerative Nesting (AGNES)
          ptc-1368288 Copper
          >No matter how good 11 was, it is outdated and superseded many times over <.
          ______________________________

          How many times you have read from the "Mathcad Power Users" that what worked in 11 does not any more in 13, 14 ? You must have read like 100's of times and how many times it was reported to me directly that 13 fails reading 11.

          No more argument, if you know how to make it work in 11 and save 2001i, even 2000, you will have extra possible contribution. Only reading will save me precious time.

          jmG
            • Cluster Analysis - Agglomerative Nesting (AGNES)
              PhilipLeitch Copper
              >How many times you have read
              >from the "Mathcad Power Users"
              >that what worked in 11 does
              >not any more in 13, 14 ?

              500 - no 502.... or was it 501. No - it's equal to how many roads a man must go down before he's considered a man divided by the length of a piece of string.

              he he he Seriously though - you do have a valid point about compatibility.

              I wish I had ver 11 to use some of the great worksheets - especially in the area of distributions. However, lack of true backwards compatibility isn't a reason to stop all progress.

              I have some great 16 bit apps that don't work on my 32 bit system, and some 32 bit apps that don't work on my 64 bit system. Even emulation (DOS Box etc.) doesn't help in all cases.

              This isn't stopping me upgrading. Lack of backwards compatability IS an issue, but that shouldn't stop you from getting upgrades.

              Yes - I stayed with XP because I found out many of my peripherals don't work on Vista (also found it too slow). But I'm still planning on upgrading to the next windows version because that's progress.

              >No more argument, if you know
              >how to make it work in 11 and
              >save 2001i, even 2000, you
              >will have extra possible
              >contribution. Only reading
              >will save me precious time.

              Agreed.

              I don't know how to make it work on 11. If there are simple rules I could take to make it work on 11... I probably still wouldn't bother because I don't have 11 (or earlier). I write sheets for myself. I post them here so that perhaps others find it useful. BUT to save people's time will note what version(s) it has been developed and tested in.

              I still wish I could get earlier versions of Matchad....

              Cheers,
              Philip
              ___________________
              Correct answers don't require correct spelling.
                • Cluster Analysis - Agglomerative Nesting (AGNES)
                  A.Non PTC Community Champion
                  On 2/19/2009 12:30:44 AM, pleitch wrote:

                  >I don't know how to make it
                  >work on 11.

                  I fixed it so it works in version 11 and version 14. You had some statements in programs with an undefined variable on the rhs. Surprisingly, they worked versions 13 and 14 because the variable was also on the lhs. Also, rows of an empty string throws an error in version 11, so I changed it to rows(0).

                  I also added a few more distance metrics.

                  AND....

                  I figured that since you had done the work of writing the hierarchical clustering algorithm, which is something I have wanted in Mathcad but been to lazy/busy to write, I would write the other essential part: a dendogram drawing algorithm. It makes it a heck of a lot easier to see the grouping!

                  Lastly, as I suspected, your calculation for AC is wrong. This is how it's supposed to be calculated:

                  http://www.unesco.org/webworld/idams/advguide/Chapt7_1_4.htm

                  You can check it here:

                  http://www.wessa.net/rwasp_agglomerativehierarchicalclustering.wasp

                  I ran the snake data though the web program and your sheet and the dendogram looks fine: all the distances match. The web value of AC is about 0.9 though, which is a lot more reasonable.

                  Richard
              • Cluster Analysis - Agglomerative Nesting (AGNES)
                PhilipLeitch Copper
                I disagree completely. It works on 13 - therefore no problem.


                >What's wrong is the 13
                >style that could probably be
                >done 11 style.
                Or from my perspective, what's wrong is that you aren't running this in the most current versions of the product. I'm not surprised that it doesn't work when used on a product that's ??5?? years out of date. Since it works on mine and everyone else's purchased within the last 4-5 years, then there is nothing actually "wrong".

                >I understand
                >that if you never had enjoyed
                >11 or earlier version you
                >can't figure how more logical
                >the programming style was.
                True, and also amusing. Using the same logic: I don't know the benefits that punchcard programming offers.

                Old technology that had its merits, and frequently has superior aspects but technology has moved on. No matter how good 11 was, it is outdated and superseeded many times over.

                However - I get what you mean, I LOVED version 2.5 (or there about), and if I could find my old disks or track down another copy I would love to install it.



                >Interesting but does not go
                >far enough.
                It has gone around the world - probably 10's of thousands of km... how far does it need to go???

                Seriously though, I'm not sure what you mean by "does not go far enough". It does exactly what I want it to - no more and no less.


                Could it have been programmed more eligantly? Yes.

                Could it have been programmed in a way that it worked on outdated versions?
                Yes.

                Could I have "taken it further"?
                Yes

                Do these arguments hold any weight for me what so ever?
                No. Not that I can see anyway.... I made it the way I wanted it and to take it any further would have wasted my time.



                The Mahalanobis post was great - especially the PCA - I was half way through coding my own PCA and had got stuck. The mathematics notation in the book I have is beyond me - seeing it in Mathcad is great. Thank you very much for that post.

                Were did the sheet come from?

                Ta.

                Philip
                ___________________
                Correct answers don't require correct spelling.
                • Cluster Analysis - Agglomerative Nesting (AGNES)
                  PhilipLeitch Copper
                  What's interesting is that I get issues when trying to find the "most optimal" answer from you k-mean version.

                  Two different messages from two different MCad Versions.

                  Philip
                  ___________________
                  Correct answers don't require correct spelling.
                  • Cluster Analysis - Agglomerative Nesting (AGNES)
                    PhilipLeitch Copper
                    Thanks - interesting.

                    The thread was a little off with what it said. There are two types of cluster analysis, but they are Partitioning and Hierarchial. Within each there are multiple approaches. The approach I posted is Hierarchial, the k-mean is one of many types of Partitioning.

                    In fact, the k-mean is very common - maybe the most popular of partitioning methods, however the authors of the book I'm going off suggest that the k-medoid method is superior to the k-mean approach.

                    Why? The k-mean results depend on the order of objects in the input (as I believe you noted in your worksheet. Also - k-mean, along with other variance minimisation methods are strongly affected by outliers.

                    The authors state:
                    "the k-medoid method is appealing because it is more robust than the error sum of squares employed in most methods. (The maximal dissimilarity used in the k-center method is even less robust.) Furthermore it allows good characterization of all clusters that are not too elongated and makes it possible to isolate outliers in most situations."

                    I will plug your data into my worksheet and post the results.

                    Philip
                    ___________________
                    Correct answers don't require correct spelling.
                      • Cluster Analysis - Agglomerative Nesting (AGNES)
                        PhilipLeitch Copper
                        It looks like there should be three groups.

                        When you look at the original plot you "think" you can see the three groups - but these aren't the three groups that play out on anlyais. The reason is that each data dimension is in the same space units of measurement, therefore the distance from point m to point o is actually fairly big in euclidean space only because of one dimension (the fist one). The other dimensions are proportionally smaller and therefore have proportionally less contribution to the distance. So to properly see the data/groups - it is best so observe with the same scale on every axis (I show both auto scaled and manuall 0-300 scale).

                        Philip
                        ___________________
                        Correct answers don't require correct spelling.
                          • Cluster Analysis - Agglomerative Nesting (AGNES)
                            ptc-1368288 Copper
                            In the program COMPARISON,

                            "Results" is red as not included in the LHS function, presumably one the dist(,,,) above. In Mathcad 11 it would have to be assigned a matrix variable name... next is rows(combined_ED) as index: that does not work either in 11. Too much work to convert to 11. What's wrong is the 13 style that could probably be done 11 style. I understand that if you never had enjoyed 11 or earlier version you can't figure how more logical the programming style was.

                            Interesting but does not go far enough.

                            jmG
                            • Cluster Analysis - Agglomerative Nesting (AGNES)
                              A.Non PTC Community Champion
                              On 2/18/2009 8:57:09 PM, pleitch wrote:
                              >It looks like there should be
                              >three groups.

                              I would argue there are four groups. You need to be very careful with data scaling.


                              The Story of the Court Mathematician

                              Once upon a time, in a land far, far away, there lived a King. The King had a very pretty daughter, but he also had a problem. His daughter was very fond of the numerous snakes that could be found in the palace grounds (she was pretty, but a little strange). Some of these snakes were poisonous and some were not, but nobody knew how to tell them apart. This worried the King greatly, because he did not want his daughter to be bitten by a poisonous snake before he could demand a huge dowry for her hand in marriage from the King of the much larger kingdom to the west. He knew he could not just get rid of all the snakes, because this would upset his beloved daughter, so he called in his most learned court scholars. When presented with the Kings dilemma, the Court Mathematician promptly announced �there is a new method called �cluster analysis� that I think may elucidate the problem�. The King, not being nearly as learned as the wise mathematician, replied �I didn�t know you could elucidate a snake, or why that would protect my daughter, but if you think it would help then your suggestion has my full support�. The mathematician was a little perplexed by the King�s answer, but was wise enough to know you did not question a king. The next day he set about making some measurements of the dead snakes he found in the palace grounds (being very wise, he realized that dead snakes, poisonous or otherwise, couldn�t bite him). He measured the length and diameter of each snake he found, as well as the length of the fangs. When he had collected enough data, he plotted the three measurements on a graph. He made sure to use the same scale for each axis, because he didn�t want to favor one measurement over another. This is what he saw:



                              There were clearly two species of snake! The only remaining problem was to determine which species was poisonous. Being very wise, he realized that although he could only do this using a live snake, along with a disposable prisoner from the King�s dungeon, he did not need to take unnecessary risks by catching more than one. It did not take long for the Court Mathematician to catch one of the larger snakes and determine that, unfortunately for the prisoner, it was poisonous. The next day the Court Mathematician took his findings to the King, who was immensely pleased. The King immediately ordered that all the larger snakes be captured and released over the border of the much smaller kingdom to the east (he did not like the King of the kingdom to the east, because many years before he had demanded a huge dowry for the hand in marriage of his very pretty daughter).

                              Time passed happily, until one day the King�s daughter was bitten by a snake and died. The King was furious, and summoned the Court Mathematician. �You told me that only the large snakes were poisonous, and now my daughter is dead. As a punishment that you will never forget you will be elucidated! Take him away!�

                              Eventually the ex-Court Mathematician recovered enough from his punishment to investigate where he had gone wrong. After much study he solved the problem by inventing two new techniques for data analysis, which he presented in a very high-pitched voice at the next inter-kingdom symposium on applied mathematics. He called these new techniques �mean centering� and �variance scaling�. When he applied these new techniques to his snake data, this is what he saw:



                              There were three species of snake! One of the two smaller species was also poisonous! The other mathematicians were so impressed they gave him a major award with a nice engraved plaque he could hang on his wall. Alas, he could never have a son to inherit the plaque and be proud of his father�s achievements.

                              There are two morals to this story:
                              1) If you want to continue to speak in a normal voice, and perhaps have children, do not anger kings
                              2) If you do not want to anger kings, scale your data correctly prior to analysis.

                              Richard
                            • Cluster Analysis - Agglomerative Nesting (AGNES)
                              A.Non PTC Community Champion
                              On 2/18/2009 7:52:41 PM, pleitch wrote:

                              >The thread was a little off
                              >with what it said. There are
                              >two types of cluster analysis,
                              >but they are Partitioning and
                              >Hierarchial.

                              Thanks for the clarification.

                              >Within each
                              >there are multiple approaches.
                              >The approach I posted is
                              >Hierarchial

                              That helps. I haven't had time yet to really look at your worksheet in detail (too busy writing fairy stories!), but it is much easier to figure out what's going on when you know that. I would suggest even adding a comment to that effect at the top of the sheet somewhere.

                              Richard
                            • Cluster Analysis - Agglomerative Nesting (AGNES)
                              ptc-1368288 Copper
                              BTW, there was a project similar to what
                              you might have done, can't retrieve.

                              jmG
                                • Cluster Analysis - Agglomerative Nesting (AGNES)
                                  PhilipLeitch Copper
                                  Thanks Jean.

                                  I realise there are projects like I have done, but I am coding myself for two reasons:

                                  1) I learn by doing - not by reading (I hear I forget, I see I learn, I do I understand). I "saw" the concept in the book and learnt it, but understanding the implementation meant coding it myself.

                                  2) Most of the worksheets I produce are for research or development. By research I mean mining company data (lots of it). By development I mean determining a process/procedure in Mathcad then replicating in source code (normally C# or T-SQL).
                                  Therefore, I don't just need to replicate a process that works, I need to actually do a bit of optimisation.

                                  For instance, in the method it is determining the distance between groups I'm still trying to think of a way of making it even faster. Cartesean joins are not good so should be avoided if possible. My rough idea is that if group A is joining to group B, and both are further away to group C than any other group - there should be no need to compute the distance from Group AB to C. In fact, Group A or Group B have to be "relatively close" to group C compared to other groups no need to compute the distance. However, one itteration later there may be a need for the calculation of AB to C. I'm assuming that some type of centroid could be recorded, and only groups in a relatively close centroid range should have their distances calculated - if they aren't calculated already. Also - another idea is that if the distance from A to C is known, as is B to C and the number of members of A, B and C are known - surely there is a simpler way of determining the distance from AB to C.

                                  .. so I'm still getting my head around this.

                                  Philip
                                  ___________________
                                  Correct answers don't require correct spelling.
                                    • Cluster Analysis - Agglomerative Nesting (AGNES)
                                      A.Non PTC Community Champion
                                      On 2/18/2009 6:17:20 PM, pleitch wrote:
                                      >Thanks Jean.
                                      >
                                      >I realise there are projects
                                      >like I have done

                                      This might be what Jean was referring to:

                                      http://collab.mathsoft.com/read?27146,17e#27146

                                      As far as my final sentence of the thread goes, it's still on my to-do list :-)

                                      Now I can also add to my to-do list a comparison of your algorithm and k-means. At the current rate I should have that done within the next decade or so :-)

                                      Richard
                                        • Cluster Analysis - Agglomerative Nesting (AGNES)
                                          PhilipLeitch Copper
                                          From the University of Wikipedia:

                                          Relation between PCA and K-means clustering
                                          It has been shown recently (2007) that the relaxed solution of K-means clustering, specified by the cluster indicators, is given by the PCA principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class scatter matrix. Thus PCA automatically projects to the subspace where the global solution of K-means clustering lie, and thus facilitate K-means clustering to find near-optimal solutions.



                                          Philip
                                          ___________________
                                          Correct answers don't require correct spelling.
                                            • Cluster Analysis - Agglomerative Nesting (AGNES)
                                              PhilipLeitch Copper
                                              I've looked at the data from the k-mean approach and I still don't see 4 groups.

                                              However - it does draw a very good point. Are these the priciple components associated with the data?

                                              Using the snake example, you wouldn't take random measurements of snakes and then assume that they will be related to how venomous they are.

                                              Instead, you would attempt to determine which variables are principle in determining the venom attribute (or degree of venomousness). This may well be done after measurements are taken.

                                              If there is no easy measure of venom - other than killing something/someone, then the principle components would logicallyb be the ones that are most useful in distinguishing types of snakes.

                                              I have now attempted to make several views of the data, including a mean divided by stdev. Even then I don't see four groups.

                                              With this data I would happily agree that when viewed from 2 of the three data axes, there does indeed look like 4 groups. But under the third these groups disipate. The final group (the most extreeme one) is so unclustered that if it is to be considered a group, so must the first group (the small one that arguably could be considered two groups).

                                              However - even with the data transormation, the AC is so low as to assume that there is no clustering/grouping occuring at all.


                                              Philip
                                              ___________________
                                              Correct answers don't require correct spelling.
                                      • Cluster Analysis - Agglomerative Nesting (AGNES)
                                        PhilipLeitch Copper
                                        Thanks. I don't have 11 so I can't test these things. I have 13 and 14 (from what I can see they aren't very different from eachother except for symbolic engine).

                                        About all I can do is save back to version 11 and cross my fingers.

                                        Philip
                                        ___________________
                                        Correct answers don't require correct spelling.
                                        • Cluster Analysis - Agglomerative Nesting (AGNES)
                                          PhilipOakley Gold
                                          In V11.2a there is a bug on "Results" in the big bit of code. It is in the Test for Results==0 as Results is at that point undefined.

                                          It is big code ;-)

                                          Philip Oakley
                                          • Cluster Analysis - Agglomerative Nesting (AGNES)
                                            PhilipLeitch Copper
                                            Ha. I loved it. VERY interesting. "The Story of the Hepatologist Pricess"

                                            If you look at my work sheet you will actually find that the AC was very low, indicating that there wasn't much in the way of clustering at all, and it could be (should be??) argued that a value that low would mean that no groups exist at all under that approach (i.e. based on the measurements - they are all just "snakes" of different sizes).

                                            My identification of 3 was based on the distance between groupings stayed relatively low until the last 2 combinations - indicating that the last two may not be appropriate. However, I would argue that even these 3 groups are so loosely clustered as not to be groups at all. In fact, I would argue that with the values you have - there is insufficient clustering to show distinct groups. Three is a small group indeed!

                                            Even with scaling the data looks linear rather than clustered, lending it to Linear Discriminant Analysis (LDA) as the more appropriate tool.

                                            Also - just because I haven't scaled data, doesn't that mean that the data can't be scaled appropriately before inclusion into my work sheet. I can scale the data as appropriate. A PCA or a Factor analysis, could be used to find a scaling of the dimensions, or a Factor Analysis could move the data in Euclidean space to a point that maximises the effect of variables equally, which would then be the appropriate transformations for the data to scale (I haven�t written the factor analysis stuff yet).

                                            This may be why/how the PCA has been found appropriate to maximise the efficiency of the k-mean approach?? I haven't read into it yet.


                                            However, it all comes back to "what's the question" - and "what's the data". I didn't know either, so had to assume that they were similar measurements in the same units.

                                            For all I knew, they were measurements of height, breadth and depth.

                                            I would still argue that on visual inspection the groups identified by either process are invalid and other processes should be used.


                                            So - when all is said and done what should the mathematician done?

                                            He should have gone to a biologist (like me) and asked what should be done. The biologist would have told the princess "DO NOT EAT ANY OF THE SNAKES".

                                            A biologist would have known that only "venomous" snakes can kill you by a bite, "poisonous" snakes have to be eaten to kill you. Almost all snake venoms can be eaten safely (alkaloid toxins are neutralised in the stomach), venoms are only toxic when taken subcutaneously (under the skin). Poisons are always toxic when consumed. A substance can be both venomous and poisonous at the same time BUT the terms are NOT interchangeable, so such a substance would be termed venomous if introduced by injection.

                                            So if the princess had just played with the snake instead of eating it - there would not have been a problem. The bite was just a distraction, or maybe what caused her to bite the snake back resulting in her death of poisoning???

                                            Assuming that this is back in the days of alchemy when they didn't make the distinction.... or that the King and Mathematician were not learnered in the ways of nature... we can assume that they meant venomous.

                                            They should have still asked a biologist.

                                            Then the biologist would at least have known WHAT to observe - and would have noted that poisonous snakes only have fangs at the front of the mouth rather than the non-venomous back of the mouth snakes (distinctly different morphologies). The fangs themselves are different in shape for the intorudction of the venom through the fangs.

                                            All snakes could have been collected - their mouth/fangs inspected and venomous ones given to the kindly king from the smaller kingdom.


                                            Still - without over-thinking the issue - a nice little problem.

                                            Philip
                                            ___________________
                                            Correct answers don't require correct spelling.
                                              • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                A.Non PTC Community Champion
                                                On 2/19/2009 7:00:03 PM, pleitch wrote:

                                                >If you look at my work sheet
                                                >you will actually find that
                                                >the AC was very low,
                                                >indicating that there wasn't
                                                >much in the way of clustering
                                                >at all

                                                See my other post for what I think about the AC metric.



                                                >A PCA or a Factor analysis,
                                                >could be used to find a
                                                >scaling of the dimensions,

                                                I don't see how you can find the scaling using PCA. In fact, PCA is very dependent upon the scaling of the data.

                                                > or
                                                >a Factor Analysis could move
                                                >the data in Euclidean space to
                                                >a point that maximises the
                                                >effect of variables equally,
                                                >which would then be the
                                                >appropriate transformations
                                                >for the data to scale (I
                                                >haven�t written the factor
                                                >analysis stuff yet).

                                                I have :-) PCA, anyway. There are more types of factor analysis than I care to think about, let alone write in Mathcad. I wrote the PCA stuff for myself though, so I would have to do some work on it before I was prepared to post it. PCA is implemented in the Data Analysis Extension Pack though, so if you get that you can save yourself some effort.

                                                >This may be why/how the PCA
                                                >has been found appropriate to
                                                >maximise the efficiency of the
                                                >k-mean approach?? I haven't
                                                >read into it yet.

                                                No. PCA just lets you reduce the number of variables, assuming the variables in the raw data are collinear. That can make it much easier for the k-means clustering to find the clusters.


                                                >So - when all is said and done
                                                >what should the mathematician
                                                >done?
                                                >
                                                >He should have gone to a
                                                >biologist (like me) and asked
                                                >what should be done.

                                                He couldn't. The Court Biologist was the Princess. I just forgot to mention that point ;-)

                                                >Assuming that this is back in
                                                >the days of alchemy when they
                                                >didn't make the
                                                >distinction....

                                                They didn't. Except perhaps the princess, but of course she wasn't asked.

                                                >They should have still asked a
                                                >biologist.

                                                I think the more important point is that if the King wanted a practical solution to a real world problem, he shouldn't have asked a mathematician :-)

                                                Richard
                                              • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                A.Non PTC Community Champion
                                                On 2/20/2009 8:00:47 AM, pleitch wrote:
                                                >I've looked at the data from
                                                >the k-mean approach and I
                                                >still don't see 4 groups.

                                                See below


                                                >However - it does draw a very
                                                >good point. Are these the
                                                >priciple components associated
                                                >with the data?

                                                I have no idea, but I doubt it. It's just the example data that was with the Cluster 3.0 software, and nothing whatsoever is known about it. For all I know it's just made up.

                                                >Using the snake example, you
                                                >wouldn't take random
                                                >measurements of snakes and
                                                >then assume that they will be
                                                >related to how venomous they
                                                >are.

                                                Well, perhaps not in the snakes example, but there are plenty of examples where that is what you do. You measure what you can, and then try to correlate that with the known property.

                                                >Instead, you would attempt to
                                                >determine which variables are
                                                >principle in determining the
                                                >venom attribute (or degree of
                                                >venomousness). This may well
                                                >be done after measurements are
                                                >taken.

                                                Exactly. So you measure everything you can, then correlate afterward. We in the spectroscopy world do it all the time.

                                                >If there is no easy measure of
                                                >venom - other than killing
                                                >something/someone, then the
                                                >principle components would
                                                >logicallyb be the ones that
                                                >are most useful in
                                                >distinguishing types of
                                                >snakes.

                                                No, not necessarily. Principal components are based solely on variance in the data. That variance may not correlate with the property you wish to measure though. As an example, take 2 species of snake. They both have about the same length, and that length is highly variable. That's one variable you measure. Now let's say you measure 10 other variables with high variance but little or no discriminatory power (diameter, etc). Finally, you measure color. The snakes have very similar colors, but the within-snake color variation is very small, so you can tell them apart using this. You now have 11 variables, 10 of them with high variance that tell you nothing about the snake species, and one with very low variance that does. The PCs will be dominated by the high variance variables, not the color, and will not help solve the problem. In fact it will make it much worse, because it will take the one useful variable you did have and mix it up in linear combinations with all the others.

                                                >I have now attempted to make
                                                >several views of the data,
                                                >including a mean divided by
                                                >stdev. Even then I don't see
                                                >four groups.
                                                >
                                                >With this data I would happily
                                                >agree that when viewed from 2
                                                >of the three data axes, there
                                                >does indeed look like 4
                                                >groups. But under the third
                                                >these groups disipate.

                                                You can't look at only 2 variables at a time. It's a 3 dimensional data set.

                                                > The
                                                >final group (the most extreeme
                                                >one) is so unclustered that if
                                                >it is to be considered a
                                                >group, so must the first group
                                                >(the small one that arguably
                                                >could be considered two
                                                >groups).

                                                That's the one :-)

                                                However, I agree it's completely subjective, and since we know nothing about the data it could be any number of groups. Maybe it's just one badly sampled continuous distribution!


                                                >However - even with the data
                                                >transormation, the AC is so
                                                >low as to assume that there is
                                                >no clustering/grouping
                                                >occuring at all.

                                                I am wondering if you have the formula for AC correct. If you have, then I would consider it a rather useless metric, because even for the snake data it's only 0.19. The grouping in the snake data is obvious (for 2 groups, anyway).

                                                Richard
                                                  • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                    ptc-1368288 Copper
                                                    You are right,

                                                    The King's daughter preferred baby snakes, shorter on the meter stick. Do you mind if I pass that lovely story to my best friend in statistics as well ?

                                                    Jean
                                                    • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                      PhilipOakley Gold
                                                      On 2/20/2009 10:01:05 AM, rijackson wrote:
                                                      >
                                                      >No, not necessarily. Principal
                                                      >components are based solely on variance
                                                      >in the data. That variance may not
                                                      >correlate with the property you wish to
                                                      >measure though. As an example, take 2
                                                      >species of snake. They both have about
                                                      >the same length, and that length is
                                                      >highly variable. That's one variable you
                                                      >measure. Now let's say you measure 10
                                                      >other variables with high variance but
                                                      >little or no discriminatory power
                                                      >(diameter, etc). Finally, you measure
                                                      >color. The snakes have very similar
                                                      >colors, but the within-snake color
                                                      >variation is very small, so you can tell
                                                      >them apart using this. You now have 11
                                                      >variables, 10 of them with high variance
                                                      >that tell you nothing about the snake
                                                      >species, and one with very low variance
                                                      >that does. The PCs will be dominated by
                                                      >the high variance variables, not the
                                                      >color, and will not help solve the
                                                      >problem. In fact it will make it much
                                                      >worse, because it will take the one
                                                      >useful variable you did have and mix it
                                                      >up in linear combinations with all the
                                                      >others.
                                                      >
                                                      >Richard

                                                      The techniques around PCA, such as SVD, etc., will identify the separate groups. In a multidimensional case one has to be cautious about misunderstandings (e.g. the separation axis may not be one of the dimensions). [This is noted for the general reader, rather than Richard who I believe already appreciates this].

                                                      We need to be careful in the explanations about where the particular distinctions are in each approach and how, often, they are different ends of the same calculation. The PCA and SVD method normally sort the components by various measures of size. Sometimes we want to start at the big end and some times the small end, depending what we want to achieve.

                                                      It is an optimisation problem. We are trying to optimise the separtion between putative groups based on various flexible criteria...

                                                      Philip Oakley
                                                    • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                      A.Non PTC Community Champion
                                                      On 2/20/2009 10:40:12 AM, jmG wrote:
                                                      > Do you mind if I
                                                      >pass that lovely story to my
                                                      >best friend in statistics as
                                                      >well ?

                                                      You can pass it on to whoever you wish. Once I post something here I figure it's in the public domain anyway.

                                                      Richard


                                                      • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                        A.Non PTC Community Champion
                                                        On 2/20/2009 9:49:32 PM, pleitch wrote:
                                                        >Thank you Richard
                                                        >
                                                        >"I am wondering if you have
                                                        >the formula for AC correct.
                                                        >
                                                        >It matched the book that I got
                                                        >it from but that DOES NOT mean
                                                        >that I have it correct.

                                                        "R" is the programming language designed and used by statisticians. I would be truly amazed if they had it wrong.

                                                        >Anyway - my point about PCA -
                                                        >and more specifically factor
                                                        >analysis (next couple of
                                                        >projects I'll be doing), is
                                                        >that you can move the factor's
                                                        >effects in euclidean space to
                                                        >maximise the clustering..... I
                                                        >think.

                                                        With or without a-priori information about the data? With a-priori information about the data that's certainly true. Without it, I'm not so sure. You can extend PCA though to give much better classification. In my field a very successful algorithm is SIMCA:

                                                        http://www.camo.com/resources/simca.html

                                                        Beware though! I believe it is patented.

                                                        Richard
                                                          • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                            PhilipLeitch Copper
                                                            yech - patants are terrible things. They lock up useful things for no good reason. It is the greatest blocking factor in Information Technology. You can barely turn on your computer without breaking a patent.

                                                            For a while Puma Tech had a patent covering any type of data synchronisation based on a time or ID Token system (i.e. identity change number).

                                                            Ummm... try and make data sync without any type of time or ID tag. That was neither innovative nor really an "invention". It just made Puma Tech lots of money, Lawers lots of money and made software more expensive. What a massive loss of productivity!!!

                                                            My background is Applied Biology and Environmental Science (double major BSc) - so I did statistics (a lot of statistics), but never dealved into clustering, PCA/Factor analysis. I've more recently completed an MBA - but that was devoid of statistics.

                                                            I've done two honours (year of research), for both the BSc and MBA and both were statistics based. But again, neither was based of this area which is why I am teaching myself. Same as Baysian statistics - I've dabbled enough to know I don't know enough so I'm about to order some books on that.

                                                            So I'm not speaking from a position of authority on these areas. Please don't hold back telling me when I'm wrong because I don't know better but want to know. I have already picked up many pieces from this thread already. Thank you for the info so far.

                                                            R - is that a free software? Do you have some links for it? I have seen some books on it but I've never heard about it. I didn't know it was a programming language until your post just now.

                                                            Philip
                                                            ___________________
                                                            Correct answers don't require correct spelling.
                                                              • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                A.Non PTC Community Champion
                                                                On 2/21/2009 3:22:52 AM, pleitch wrote:
                                                                >yech - patants are terrible
                                                                >things.

                                                                Only when they are other peoples :-)

                                                                >My background is Applied
                                                                >Biology and Environmental
                                                                >Science (double major BSc) -
                                                                >so I did statistics (a lot of
                                                                >statistics), but never dealved
                                                                >into clustering, PCA/Factor
                                                                >analysis. I've more recently
                                                                >completed an MBA - but that
                                                                >was devoid of statistics.
                                                                >
                                                                >I've done two honours (year of
                                                                >research), for both the BSc
                                                                >and MBA and both were
                                                                >statistics based. But again,
                                                                >neither was based of this area
                                                                >which is why I am teaching
                                                                >myself. Same as Baysian
                                                                >statistics - I've dabbled
                                                                >enough to know I don't know
                                                                >enough so I'm about to order
                                                                >some books on that.

                                                                Well, my only formal training in statistics was when I was force fed a dose of it during my physics degree. That did not cover anything to do with multivariate analysis, cluster analysis, etc. I have learned it from books, by listening to a lot of talks at conferences, and by getting a lot of advice from colleagues that know more about it than I do. In my field people usually avoid the term "statistics", and use the term "chemometrics" instead. I believe the term was coined because the statistics based algorithms are often applied to data and/or applied in a way that violates the statistical assumptions. So the results are not really statistical, but they work and are exceeding useful.

                                                                >R - is that a free software?
                                                                >Do you have some links for it?
                                                                >I have seen some books on it
                                                                >but I've never heard about it.
                                                                >I didn't know it was a
                                                                >programming language until
                                                                >your post just now.

                                                                I don't know much about it myself. I had someone recommend it to me about a year ago. He seemed to think it was the best thing since sliced bread (which, as it happens, I hate), but then he was a statistician. I think it's sort of a statistical Matlab. You can download it here:

                                                                http://www.r-project.org/

                                                                Richard
                                                                  • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                    PhilipLeitch Copper
                                                                    I've now been over this a number of times.

                                                                    The book I use has the AC based on the banner.

                                                                    The length of individual bands indicates the disimilarity. The banner's length is the distance divided by the largest distance (last join)

                                                                    Therefore, this is a distance measured FROM 1. A distance value on the banner of 0.2 would indicate a distance of 0.8.

                                                                    What I misuderstood was that not ALL lengths are measured. Only the lengths of the original items (e.g. "a", "b", "c") are counted, BUT are still measured in comparison to the last join of all groups. I was adding all the lengths. i means "item", NOT group.

                                                                    The logic is that the last join is the largest seperation in groups. If all items are tightly clustered into seperate groups, where the groups are quite distant, then all the distance values will be low in comparison to the final group join.

                                                                    So - how far the groups are away from eachother will "help", but only in the last join. Appart from that the AC is more of a measure of how quickly items are grouped, and how close to ach other items they are in comparison to the proximity of groups.

                                                                    This value looks much better, but still isn't exact. I would put money on the calculations being slightly different, leading to a small difference in value to the R version.

                                                                    Philip
                                                                    ___________________
                                                                    Correct answers don't require correct spelling.
                                                                  • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                    PhilipLeitch Copper
                                                                    I think I found the flaw in the AC.

                                                                    Based on the link you sent I see that the "relative" distance is incorrect (scalling by the longest distance). The actual distance is what should be used.

                                                                    Please review attachment and see if it is more accuratte.

                                                                    By the way - I LOVE the line combinations - dendogram. I've seen this in my books. It is effective, clear and I had no idea how to produce it.

                                                                    Philip
                                                                    ___________________
                                                                    Correct answers don't require correct spelling.
                                                                      • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                        A.Non PTC Community Champion
                                                                        On 2/21/2009 6:59:13 AM, pleitch wrote:
                                                                        >I think I found the flaw in
                                                                        >the AC.

                                                                        I think it's still wrong. If I feed the data into the web program I get AC=0.8877464975. That's high, which is what I would expect. I'm sure it's doing the same calculation, because this is the dendogram it produces:



                                                                        The branches are not in the same order, but the groupings and (very importantly) the distances are identical. That's good, because it means that your program and the dendogram routine are correct.

                                                                        When I read this:

                                                                        "Let d(i) denote the dissimilarity of object i to the first cluster it is merged with, divided by the dissimilarity of the merger in the last step of the algorithm.
                                                                        The agglomerative coefficient (AC) is defined as the average of all [ 1-d(i)]"

                                                                        it says to me that you have to calculate AC as you go through each step of the clustering. That's why I didn't attempt to fix it. It needs to be calculated in your main program, and I didn't want to mess with that. I figured you could probably do it in less time that it would take me to figure out where to begin.

                                                                        >By the way - I LOVE the line
                                                                        >combinations - dendogram.

                                                                        I wouldn't know what to do with a hierarchical clustering algorithm without it :-)

                                                                        Richard
                                                                  • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                    PhilipLeitch Copper
                                                                    Thank you Richard

                                                                    "I am wondering if you have the formula for AC correct."

                                                                    It matched the book that I got it from but that DOES NOT mean that I have it correct. I will double check of the links you sent through. The version I have attempts to determine the % of distance graph (the bar graph I made) that is white. So if most of the values are low, then one or two very large values at the end (distant examples) (as happened in this example), then the amount of white space will be very high.

                                                                    Anyway - my point about PCA - and more specifically factor analysis (next couple of projects I'll be doing), is that you can move the factor's effects in euclidean space to maximise the clustering..... I think. I'll give it a go anyway.


                                                                    Philip
                                                                    ___________________
                                                                    Correct answers don't require correct spelling.
                                                                    • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                      A.Non PTC Community Champion
                                                                      On 2/22/2009 4:50:24 AM, pleitch wrote:

                                                                      >This value looks much better,
                                                                      >but still isn't exact. I
                                                                      >would put money on the
                                                                      >calculations being slightly
                                                                      >different, leading to a small
                                                                      >difference in value to the R
                                                                      >version.

                                                                      You would have lost your money. It's a bug. Your indexing should start at 0, not 1 (Labels has no header row). Then the numbers match to within roundoff. I figured out how to get the distances from the web program too, and those also match to within roundoff (a rather more demanding comparison than just eyeballing dendograms!). So numerically, we are now golden.

                                                                      I changed the functions slightly so that UPGMA takes the function name of the distance measure as a parameter. That makes it much easier to call it multiple times to compare the measures. I also added an autolabeling function, because I'm not into typing large vectors of arbitrary strings every time the data is changed.

                                                                      Richard
                                                                        • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                          PhilipLeitch Copper
                                                                          Ahhh... fantastic.

                                                                          Thanks for the colaboration with this.

                                                                          Philip
                                                                          ___________________
                                                                          Correct answers don't require correct spelling.
                                                                          • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                            PhilipLeitch Copper

                                                                            >>
                                                                            >> I
                                                                            >>would put money on the
                                                                            >>calculations being slightly
                                                                            >>different, leading to a small
                                                                            >>difference in value to the R
                                                                            >>version.
                                                                            >
                                                                            >You would have lost your money. It's a
                                                                            >bug.
                                                                            >Richard

                                                                            Just a quick comment - there's a reason I don't gamble!


                                                                            Philip
                                                                            ___________________
                                                                            Nobody can hear you scream in Euclidean space.
                                                                              • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                PhilipLeitch Copper
                                                                                Attached with Ward's method.

                                                                                Ward's method uses a completely different approach, so I had to re-program some areas.

                                                                                What I am hoping to achieve is a worksheet that I can plug values in the top, then using a radio button or combo box, select the appropriate method AND distance calculation. So this is still a work in progress.


                                                                                A warning on Ward:

                                                                                It is extremely sensitive to outliers.

                                                                                Ward's method makes dissimilarity blow up to infinity. "Indeed, in many real data sets we have noticed that the final level of Ward's clustering was much larger than the largest dissimilarity between any two objects." therefore "The dissimilarity of a single linkage will tend to 0: The more points you draw from two strictuly postive densities, the closer will be their nearest neighbors. For any two such densities, their limiting dissimilarity becomes 0."

                                                                                What's all that mean? Just check oout the Histograms:




                                                                                Philip
                                                                                ___________________
                                                                                Nobody can hear you scream in Euclidean space.
                                                                                  • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                    A.Non PTC Community Champion
                                                                                    On 2/24/2009 12:55:14 AM, pleitch wrote:
                                                                                    >Attached with Ward's
                                                                                    >method.

                                                                                    >Ward's method uses a
                                                                                    >completely different approach,
                                                                                    >so I had to re-program some
                                                                                    >areas.

                                                                                    What's the point of the "Null" distance? You can't use it with the average linkage (at least, not meaningfully), and Ward's method doesn't use the distances so it doesn't matter what the distance metric is.

                                                                                    >What I am hoping to
                                                                                    >achieve is a worksheet that I
                                                                                    >can plug values in the top,
                                                                                    >then using a radio button or
                                                                                    >combo box, select the
                                                                                    >appropriate method AND
                                                                                    >distance calculation. So this
                                                                                    >is still a work in
                                                                                    >progress.

                                                                                    Yes, that would be my goal too. I can add a bunch of that stuff, but I have a long trip coming up in a couple of days (3-4 weeks in Asia) and I have a lot to do before I go so I'm not sure when I'll get to it.

                                                                                    Richard
                                                                                      • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                        PhilipLeitch Copper

                                                                                        >
                                                                                        >What's the point of the "Null" distance?
                                                                                        >You can't use it with the average
                                                                                        >linkage (at least, not meaningfully),
                                                                                        >and Ward's method doesn't use the
                                                                                        >distances so it doesn't matter what the
                                                                                        >distance metric is.

                                                                                        Ward doesn't need it - but the existing procedure still called it. So to avoid re-writing the procedure I supplied a distance method that didn't calculate anything.

                                                                                        So this was to ensure all the other functions would work in-place (without modification), even though some methds don't use them.

                                                                                        (I played with a couple of ideas - i think that's how I left it).

                                                                                        >>What I am hoping to
                                                                                        >>achieve is a worksheet that I
                                                                                        >>can plug values in the top,
                                                                                        >>then using a radio button or
                                                                                        >>combo box, select the
                                                                                        >>appropriate method AND
                                                                                        >>distance calculation. So this
                                                                                        >>is still a work in
                                                                                        >>progress.
                                                                                        >
                                                                                        >Yes, that would be my goal too. I can
                                                                                        >add a bunch of that stuff, but I have a
                                                                                        >long trip coming up in a couple of days
                                                                                        >(3-4 weeks in Asia) and I have a lot to
                                                                                        >do before I go so I'm not sure when I'll
                                                                                        >get to it.

                                                                                        You have already added quite a lot. Distance measures, dendograms and so on.

                                                                                        FYI - I'll be adding some other methods, including centroid, over the next few days. So

                                                                                        I'll also clean up some of the names. ED stands for Euclidean Distance matrix - but that isn't valid any more. I've really enjoyed this little project.


                                                                                        If you are traveling to Australia, let me know - our paths might cross.

                                                                                        Cheers,
                                                                                        Philip
                                                                                        ___________________
                                                                                        Nobody can hear you scream in Euclidean space.
                                                                                • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                  A.Non PTC Community Champion
                                                                                  On 2/22/2009 6:13:35 PM, pleitch wrote:

                                                                                  >Thanks for the colaboration
                                                                                  >with this.

                                                                                  This one definitely works both ways. I also wanted hierarchical cluster analysis in Mathcad, and you have done much more than half the work.

                                                                                  I think we need to add the calculation of AC to the UPGMA program too. It would make it much easier if it's called more than once with different parameters. I may do it tomorrow. I'll have UPGMA return a nested matrix with the current Groups matrix as one entry and AC as the other. That way it won't break the dendogram routine :-)

                                                                                  Richard

                                                                                    • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                      PhilipLeitch Copper
                                                                                      Although it would make sense to include it - I think I'll leave it seperate.

                                                                                      AC isn't frequently noted when calculating this measure. Out of three text books I have, only the book that ONLY deals with "finding groups in data" has it.

                                                                                      Also - will it make it quicker? I can't see that the AC calculation will actually take very long.

                                                                                      The AC can only be calculated at the end - because it measures the length of each item as (1 - Proportional Distance). The proportional distance is the distance of the item's join to the distance of the final join.

                                                                                      So you would need to have a check on each itteration to determine if one of the items is "singular", if so, add it to a running total. Then - at the end of the expression, divide all the "singular" distances by the number of items, then by the final distance...


                                                                                      Assume n is the number of Items

                                                                                      A minimum has been found:
                                                                                      ? Is this a singular Item ?
                                                                                      If Yes
                                                                                      TotalDistance <- (TotalDistance + (CurrentDistance/n))
                                                                                      ? Is this the final join ? //Note - not mutually exclusive to last if.
                                                                                      If Yes
                                                                                      AC <- (1-(TotalDistance/CurrentDistance))



                                                                                      On a final note.

                                                                                      I like the "auto-label" and "auto-scale". I also like the "Display Nested Matricies" - that isn't an option I knew about but went looking for it after I saw you use it. Also - adding comments with quotations marks is great. I would have added lots of comments if I had known about that.

                                                                                      Thanks again.

                                                                                      I'll probably be adding other distance measures to this shortly.... but first I'll give it a run with my company data.

                                                                                      Philip
                                                                                      ___________________
                                                                                      Nobody can hear you scream in Euclidean space.
                                                                                    • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                      PhilipLeitch Copper
                                                                                      RE Tag it on the end... Yeah - that will work... but it will be slower than what I suggested.

                                                                                      Probably won't notice much unless the are lots and lots of objects.

                                                                                      Ward's method - also known as ESS or Error Sum of Squares - is one of the next couple I am doing (litterally implementing now).

                                                                                      If you want to wait off - I'll have it done soon.




                                                                                      Philip
                                                                                      ___________________
                                                                                      Nobody can hear you scream in Euclidean space.
                                                                                        • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                          A.Non PTC Community Champion
                                                                                          On 2/23/2009 6:43:24 PM, pleitch wrote:
                                                                                          >RE Tag it on the end... Yeah -
                                                                                          >that will work... but it will
                                                                                          >be slower than what I
                                                                                          >suggested.
                                                                                          >
                                                                                          >Probably won't notice much
                                                                                          >unless the are lots and lots
                                                                                          >of objects.

                                                                                          But no slower than it is now. Thinking about it though, maybe the right thing to to is just encapsulate AC in it's own function. Then if you don't need it, the execution time is zero :-)

                                                                                          >Ward's method - also known as
                                                                                          >ESS or Error Sum of Squares -
                                                                                          >is one of the next couple I am
                                                                                          >doing (litterally implementing
                                                                                          >now).

                                                                                          I haven't heard it called that, but that's not surprising. If you want to check it's working correctly, that R-based web program also does Wards method.

                                                                                          >If you want to wait off - I'll
                                                                                          >have it done soon.

                                                                                          I am not one to unnecessarily duplicate work!

                                                                                          Richard
                                                                                            • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                              PhilipLeitch Copper
                                                                                              Done - check my recent posts.

                                                                                              Just to be clear, it isn't called ESS/Error sum of squares. However, it is based on this. It finds the Euclidean distances between the objects of the cluster and it's centroid. So instead of determining distances to combine close objects, it is determining the next objects that would combine to create the next smallest group (i.e. the next tightest group of objects). That way the focus is on creating small groups - stepping out to larger groups as apposed to joining close objects together as possible.

                                                                                              At least... this is my understanding.

                                                                                              Philip
                                                                                              ___________________
                                                                                              Nobody can hear you scream in Euclidean space.
                                                                                          • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                            A.Non PTC Community Champion
                                                                                            On 2/23/2009 12:51:42 AM, pleitch wrote:
                                                                                            >Although it would make sense
                                                                                            >to include it - I think I'll
                                                                                            >leave it seperate.

                                                                                            AC isn't
                                                                                            >frequently noted when
                                                                                            >calculating this measure. Out
                                                                                            >of three text books I have,
                                                                                            >only the book that ONLY deals
                                                                                            >with "finding groups in data"
                                                                                            >has it.

                                                                                            Also - will it make
                                                                                            >it quicker? I can't see that
                                                                                            >the AC calculation will
                                                                                            >actually take very long.

                                                                                            No, it would just make it easier to call it multiple times. I am a great believer in encapsulating everything in a function, rather than wring out the same code more than once.

                                                                                            The
                                                                                            >AC can only be calculated at
                                                                                            >the end - because it measures
                                                                                            >the length of each item as (1
                                                                                            >- Proportional Distance). The
                                                                                            >proportional distance is the
                                                                                            >distance of the item's join to
                                                                                            >the distance of the final
                                                                                            >join.

                                                                                            So you would need to
                                                                                            >have a check on each
                                                                                            >itteration to determine if one
                                                                                            >of the items is "singular", if
                                                                                            >so, add it to a running total.
                                                                                            >Then - at the end of the
                                                                                            >expression, divide all the
                                                                                            >"singular" distances by the
                                                                                            >number of items, then by the
                                                                                            >final distance...


                                                                                            Assume n
                                                                                            >is the number of Items

                                                                                            A
                                                                                            >minimum has been found:
                                                                                            ? Is
                                                                                            >this a singular Item ?
                                                                                            If Yes

                                                                                            >TotalDistance <-
                                                                                            >(TotalDistance +
                                                                                            >(CurrentDistance/n))
                                                                                            ? Is this
                                                                                            >the final join ? //Note - not
                                                                                            >mutually exclusive to last
                                                                                            >if.
                                                                                            If Yes
                                                                                            AC <-
                                                                                            >(1-(TotalDistance/CurrentDista
                                                                                            >nce))

                                                                                            You are making it sound much more difficult than it is. I was just going to tag it on to the end of the routine, right before you return "Results."


                                                                                            >I'll
                                                                                            >probably be adding other
                                                                                            >distance measures to this
                                                                                            >shortly....

                                                                                            In my experience what makes a much bigger difference to the clustering is the linkage: i.e. eUPGMA. What you have is the average linkage, but there are lots of others. single, complete, and centroid are common. I have had a lot of success with Ward's method in the past.

                                                                                            Richard
                                                                                            • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                              PhilipLeitch Copper
                                                                                              Post post post.... sorry for all the posts in quick succession - but I forgot to say that I have bookmarked that R page and I have been using it as a cross-check.

                                                                                              Ward checks out (both values and AC).


                                                                                              Cheers,
                                                                                              Philip
                                                                                              ___________________
                                                                                              Nobody can hear you scream in Euclidean space.
                                                                                              • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                                A.Non PTC Community Champion
                                                                                                On 2/24/2009 1:09:40 AM, pleitch wrote:
                                                                                                >Done - check my recent posts.
                                                                                                >
                                                                                                >Just to be clear, it isn't
                                                                                                >called ESS/Error sum of
                                                                                                >squares. However, it is based
                                                                                                >on this. It finds the
                                                                                                >Euclidean distances between
                                                                                                >the objects of the cluster and
                                                                                                >it's centroid. So instead of
                                                                                                >determining distances to
                                                                                                >combine close objects, it is
                                                                                                >determining the next objects
                                                                                                >that would combine to create
                                                                                                >the next smallest group (i.e.
                                                                                                >the next tightest group of
                                                                                                >objects). That way the focus
                                                                                                >is on creating small groups -
                                                                                                >stepping out to larger groups
                                                                                                >as apposed to joining close
                                                                                                >objects together as possible.
                                                                                                >
                                                                                                >At least... this is my
                                                                                                >understanding.

                                                                                                Sounds about right. Here's another description:

                                                                                                "The previous algorithms merge the two groups which are most similar. Ward's Algorithm, however, tries to find as homogeneous groups as possible. This means that only two groups are merged which show the smallest growth in heterogeneity factor H. Instead of determining the spectral distance, the Ward�s Algorithm determines the growth of heterogeneity H."

                                                                                                Richard
                                                                                                • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                                  PhilipLeitch Copper
                                                                                                  >But why not just ignore it? It doesn't
                                                                                                  >matter what the distance method is. If
                                                                                                  >you wanted some minor time saving, why
                                                                                                  >not just set all the elements to 0?

                                                                                                  Ahhh - because I was trying different options and didn't come back and finish it off. It started off being set to zero, then I realised Ward needed the data supplied. So I set the Null to contain the data - but that wasn't very logical, so I made sure I passed the data to both methods. Obviously I never came back and re-set Null to zeros.


                                                                                                  >No. I have to go to Bali first, then
                                                                                                  >Beijing. It's all business, but I'll
                                                                                                  >slot in some personal time in Bali
                                                                                                  >(would be dumb not to!). I've been to
                                                                                                  >Beijing so many times that's just
                                                                                                  >another long business trip :-(

                                                                                                  If you go to Bali then you will see lots of Australians. Have fun.

                                                                                                  Philip
                                                                                                  ___________________
                                                                                                  Nobody can hear you scream in Euclidean space.
                                                                                                  • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                                    A.Non PTC Community Champion
                                                                                                    On 2/25/2009 7:35:21 AM, pleitch wrote:

                                                                                                    >Ward doesn't need it - but the
                                                                                                    >existing procedure still
                                                                                                    >called it. So to avoid
                                                                                                    >re-writing the procedure I
                                                                                                    >supplied a distance method
                                                                                                    >that didn't calculate
                                                                                                    >anything.

                                                                                                    But why not just ignore it? It doesn't matter what the distance method is. If you wanted some minor time saving, why not just set all the elements to 0?

                                                                                                    >If you are traveling to
                                                                                                    >Australia, let me know - our
                                                                                                    >paths might cross.

                                                                                                    No. I have to go to Bali first, then Beijing. It's all business, but I'll slot in some personal time in Bali (would be dumb not to!). I've been to Beijing so many times that's just another long business trip :-(

                                                                                                    Richard
                                                                                                    • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                                      PhilipLeitch Copper
                                                                                                      Sure thing.

                                                                                                      From my perspective this has been collaborating at its best. Some things I would not have included, some things would have been wrong. I'm glad I've been able to contribute. Normally I sit back and just learn from the likes of Jean, Tom and yourself.

                                                                                                      I'll include that link with the next round of mods.

                                                                                                      Philip
                                                                                                      ___________________
                                                                                                      Nobody can hear you scream in Euclidean space.
                                                                                                      • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                                        A.Non PTC Community Champion
                                                                                                        On 2/26/2009 1:39:32 AM, pleitch wrote:
                                                                                                        >Latest Additions/Revisions:
                                                                                                        >Cleaned up some of the
                                                                                                        >variables and text.
                                                                                                        >
                                                                                                        >Cleaned up Null distance to be
                                                                                                        >zero (should have been this
                                                                                                        >from the start).
                                                                                                        >
                                                                                                        >
                                                                                                        >Added more methods:
                                                                                                        >Single Linkage (Nearest
                                                                                                        >Neighbour)
                                                                                                        >Clomplete Linkage (Furthest
                                                                                                        >Neighbour)
                                                                                                        >Centroid Method (Average)

                                                                                                        Excellent :-)


                                                                                                        >Updated Bar graph is set to
                                                                                                        >match the length shown on the
                                                                                                        >web site (which is different
                                                                                                        >to my book):
                                                                                                        >http://www.wessa.net/rwasp_agg
                                                                                                        >lomerativehierarchicalclusteri
                                                                                                        >ng.wasp#output
                                                                                                        >
                                                                                                        >However, the order of the bar
                                                                                                        >graph is still out. The order
                                                                                                        >is based on both distance and
                                                                                                        >join path (New_Labels variable
                                                                                                        >later in the worksheet) - but
                                                                                                        >I don't have the time/care
                                                                                                        >factor to do that right now.

                                                                                                        I also noticed the differences in the graphs. If you compare the one on the web site to the dendogram on the web site you'll notice they contain exactly the same information though. The bar graph is a sort of "filled in" dendogram. If I find time while I'm in Asia I might see if I can modify the dendogram routine to also return the correct data for the banner plot (a.k.a bar graph :-)). I also intend to convert the AC and dendogram calculations to functions. Ideally, it should be possible to call all of these routines multiple times without having to copy code. Finally, a lot of stuff needs to be dropped into collapsed areas so it's easier to scroll up and down the worksheet. and of course we need list boxes, etc to pick methods. I'll see what I can get done, but for sure nothing for the next week or so (until I get to Beijing).

                                                                                                        Richard
                                                                                                        • Cluster Analysis - Agglomerative Nesting (AGNES)
                                                                                                          ptc-1368288 Copper
                                                                                                          On 2/27/2009 12:12:40 AM, pleitch wrote:
                                                                                                          >>BTW, as you are in
                                                                                                          >>the medical fields, do you have in
                                                                                                          >>project
                                                                                                          >>the Delaunay diagram ? I tried long time
                                                                                                          >>ago but not to avail, because too
                                                                                                          >>novice.
                                                                                                          >
                                                                                                          >No. I've never dealt with
                                                                                                          >that.
                                                                                                          >
                                                                                                          >Do you mean creating a
                                                                                                          >triangle mesh, where...
                                                                                                          >constraints...
                                                                                                          >
                                                                                                          >If so - then no... but here is
                                                                                                          >a link that might be helpful:
                                                                                                          >http://www.cs.cmu.edu/~quake/t
                                                                                                          >riangle.html
                                                                                                          >
                                                                                                          >If not - what is it you are
                                                                                                          >after?
                                                                                                          >
                                                                                                          >Ta.
                                                                                                          >Philip
                                                                                                          >____________________________

                                                                                                          Interesting applets in there.

                                                                                                          http://www.diku.dk/hjemmesider/studerende/duff/Fortune/

                                                                                                          It sounds a big project !

                                                                                                          jmG