cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Community Tip - Want the oppurtunity to discuss enhancements to PTC products? Join a working group! X

Cluster Analysis - Agglomerative Nesting (AGNES)

PhilipLeitch
1-Newbie

Cluster Analysis - Agglomerative Nesting (AGNES)

AGGLOMERATIVE NESTING (AGNES) USING THE GROUP AVERAGE METHOD OF SOKAL AND MICHENER (1958)

As adapted from "L. Kaufman and P.J. Rousseeuw (1990), FINDING GROUPS IN DATA: AN INTRODUCTION TO CLUSTER ANALYSIS, New York: John Wiley." Chapter 5.

For those who believe all statistics are lies - rest assured there is no significance testing here - the decision for what is a seperate group and what is not is purely up to the researcher's own discretion.

Philip
___________________
Correct answers don't require correct spelling.
60 REPLIES 60
PhilipOakley
5-Regular Member
(To:PhilipLeitch)

In V11.2a there is a bug on "Results" in the big bit of code. It is in the Test for Results==0 as Results is at that point undefined.

It is big code 😉

Philip Oakley

Thanks. I don't have 11 so I can't test these things. I have 13 and 14 (from what I can see they aren't very different from eachother except for symbolic engine).

About all I can do is save back to version 11 and cross my fingers.

Philip
___________________
Correct answers don't require correct spelling.

On 2/18/2009 6:30:13 AM, philipoakley wrote:
>In V11.2a there is a bug on
>"Results" in the big bit of
>code. It is in the Test for
>Results==0 as Results is at
>that point undefined.
>
>It is big code 😉
>
>Philip Oakley
____________________________
Thanks, Philip & Philip

No need to look at it .

jmG






BTW, there was a project similar to what
you might have done, can't retrieve.

jmG

Thanks Jean.

I realise there are projects like I have done, but I am coding myself for two reasons:

1) I learn by doing - not by reading (I hear I forget, I see I learn, I do I understand). I "saw" the concept in the book and learnt it, but understanding the implementation meant coding it myself.

2) Most of the worksheets I produce are for research or development. By research I mean mining company data (lots of it). By development I mean determining a process/procedure in Mathcad then replicating in source code (normally C# or T-SQL).
Therefore, I don't just need to replicate a process that works, I need to actually do a bit of optimisation.

For instance, in the method it is determining the distance between groups I'm still trying to think of a way of making it even faster. Cartesean joins are not good so should be avoided if possible. My rough idea is that if group A is joining to group B, and both are further away to group C than any other group - there should be no need to compute the distance from Group AB to C. In fact, Group A or Group B have to be "relatively close" to group C compared to other groups no need to compute the distance. However, one itteration later there may be a need for the calculation of AB to C. I'm assuming that some type of centroid could be recorded, and only groups in a relatively close centroid range should have their distances calculated - if they aren't calculated already. Also - another idea is that if the distance from A to C is known, as is B to C and the number of members of A, B and C are known - surely there is a simpler way of determining the distance from AB to C.

.. so I'm still getting my head around this.

Philip
___________________
Correct answers don't require correct spelling.

On 2/18/2009 6:17:20 PM, pleitch wrote:
>Thanks Jean.
>
>I realise there are projects
>like I have done

This might be what Jean was referring to:

http://collab.mathsoft.com/read?27146,17e#27146

As far as my final sentence of the thread goes, it's still on my to-do list 🙂

Now I can also add to my to-do list a comparison of your algorithm and k-means. At the current rate I should have that done within the next decade or so 🙂

Richard

Thanks - interesting.

The thread was a little off with what it said. There are two types of cluster analysis, but they are Partitioning and Hierarchial. Within each there are multiple approaches. The approach I posted is Hierarchial, the k-mean is one of many types of Partitioning.

In fact, the k-mean is very common - maybe the most popular of partitioning methods, however the authors of the book I'm going off suggest that the k-medoid method is superior to the k-mean approach.

Why? The k-mean results depend on the order of objects in the input (as I believe you noted in your worksheet. Also - k-mean, along with other variance minimisation methods are strongly affected by outliers.

The authors state:
"the k-medoid method is appealing because it is more robust than the error sum of squares employed in most methods. (The maximal dissimilarity used in the k-center method is even less robust.) Furthermore it allows good characterization of all clusters that are not too elongated and makes it possible to isolate outliers in most situations."

I will plug your data into my worksheet and post the results.

Philip
___________________
Correct answers don't require correct spelling.

It looks like there should be three groups.

When you look at the original plot you "think" you can see the three groups - but these aren't the three groups that play out on anlyais. The reason is that each data dimension is in the same space units of measurement, therefore the distance from point m to point o is actually fairly big in euclidean space only because of one dimension (the fist one). The other dimensions are proportionally smaller and therefore have proportionally less contribution to the distance. So to properly see the data/groups - it is best so observe with the same scale on every axis (I show both auto scaled and manuall 0-300 scale).

Philip
___________________
Correct answers don't require correct spelling.

What's interesting is that I get issues when trying to find the "most optimal" answer from you k-mean version.

Two different messages from two different MCad Versions.

Philip
___________________
Correct answers don't require correct spelling.

Here is the work sheet "Mahalanobis"

Courtesy Jim S. (Offroc) Aug 19, 2007

jmG

On 2/18/2009 9:09:44 PM, pleitch wrote:

>Two different messages from
>two different MCad Versions.

A piece of dumb programming on my part. I'm surprised it actually worked in version 11. This works in all versions (but I have not fixed the issue with Dunn's validity index).

Richard

In the program COMPARISON,

"Results" is red as not included in the LHS function, presumably one the dist(,,,) above. In Mathcad 11 it would have to be assigned a matrix variable name... next is rows(combined_ED) as index: that does not work either in 11. Too much work to convert to 11. What's wrong is the 13 style that could probably be done 11 style. I understand that if you never had enjoyed 11 or earlier version you can't figure how more logical the programming style was.

Interesting but does not go far enough.

jmG

I disagree completely. It works on 13 - therefore no problem.


>What's wrong is the 13
>style that could probably be
>done 11 style.
Or from my perspective, what's wrong is that you aren't running this in the most current versions of the product. I'm not surprised that it doesn't work when used on a product that's ??5?? years out of date. Since it works on mine and everyone else's purchased within the last 4-5 years, then there is nothing actually "wrong".

>I understand
>that if you never had enjoyed
>11 or earlier version you
>can't figure how more logical
>the programming style was.
True, and also amusing. Using the same logic: I don't know the benefits that punchcard programming offers.

Old technology that had its merits, and frequently has superior aspects but technology has moved on. No matter how good 11 was, it is outdated and superseeded many times over.

However - I get what you mean, I LOVED version 2.5 (or there about), and if I could find my old disks or track down another copy I would love to install it.



>Interesting but does not go
>far enough.
It has gone around the world - probably 10's of thousands of km... how far does it need to go???

Seriously though, I'm not sure what you mean by "does not go far enough". It does exactly what I want it to - no more and no less.


Could it have been programmed more eligantly? Yes.

Could it have been programmed in a way that it worked on outdated versions?
Yes.

Could I have "taken it further"?
Yes

Do these arguments hold any weight for me what so ever?
No. Not that I can see anyway.... I made it the way I wanted it and to take it any further would have wasted my time.



The Mahalanobis post was great - especially the PCA - I was half way through coding my own PCA and had got stuck. The mathematics notation in the book I have is beyond me - seeing it in Mathcad is great. Thank you very much for that post.

Were did the sheet come from?

Ta.

Philip
___________________
Correct answers don't require correct spelling.

>No matter how good 11 was, it is outdated and superseded many times over <.
______________________________

How many times you have read from the "Mathcad Power Users" that what worked in 11 does not any more in 13, 14 ? You must have read like 100's of times and how many times it was reported to me directly that 13 fails reading 11.

No more argument, if you know how to make it work in 11 and save 2001i, even 2000, you will have extra possible contribution. Only reading will save me precious time.

jmG

>How many times you have read
>from the "Mathcad Power Users"
>that what worked in 11 does
>not any more in 13, 14 ?

500 - no 502.... or was it 501. No - it's equal to how many roads a man must go down before he's considered a man divided by the length of a piece of string.

he he he Seriously though - you do have a valid point about compatibility.

I wish I had ver 11 to use some of the great worksheets - especially in the area of distributions. However, lack of true backwards compatibility isn't a reason to stop all progress.

I have some great 16 bit apps that don't work on my 32 bit system, and some 32 bit apps that don't work on my 64 bit system. Even emulation (DOS Box etc.) doesn't help in all cases.

This isn't stopping me upgrading. Lack of backwards compatability IS an issue, but that shouldn't stop you from getting upgrades.

Yes - I stayed with XP because I found out many of my peripherals don't work on Vista (also found it too slow). But I'm still planning on upgrading to the next windows version because that's progress.

>No more argument, if you know
>how to make it work in 11 and
>save 2001i, even 2000, you
>will have extra possible
>contribution. Only reading
>will save me precious time.

Agreed.

I don't know how to make it work on 11. If there are simple rules I could take to make it work on 11... I probably still wouldn't bother because I don't have 11 (or earlier). I write sheets for myself. I post them here so that perhaps others find it useful. BUT to save people's time will note what version(s) it has been developed and tested in.

I still wish I could get earlier versions of Matchad....

Cheers,
Philip
___________________
Correct answers don't require correct spelling.

On 2/19/2009 12:30:44 AM, pleitch wrote:

>I don't know how to make it
>work on 11.

I fixed it so it works in version 11 and version 14. You had some statements in programs with an undefined variable on the rhs. Surprisingly, they worked versions 13 and 14 because the variable was also on the lhs. Also, rows of an empty string throws an error in version 11, so I changed it to rows(0).

I also added a few more distance metrics.

AND....

I figured that since you had done the work of writing the hierarchical clustering algorithm, which is something I have wanted in Mathcad but been to lazy/busy to write, I would write the other essential part: a dendogram drawing algorithm. It makes it a heck of a lot easier to see the grouping!

Lastly, as I suspected, your calculation for AC is wrong. This is how it's supposed to be calculated:

http://www.unesco.org/webworld/idams/advguide/Chapt7_1_4.htm

You can check it here:

http://www.wessa.net/rwasp_agglomerativehierarchicalclustering.wasp

I ran the snake data though the web program and your sheet and the dendogram looks fine: all the distances match. The web value of AC is about 0.9 though, which is a lot more reasonable.

Richard
RichardJ
19-Tanzanite
(To:RichardJ)

A minor bug fix.

Richard

On 2/18/2009 8:57:09 PM, pleitch wrote:
>It looks like there should be
>three groups.

I would argue there are four groups. You need to be very careful with data scaling.


The Story of the Court Mathematician

Once upon a time, in a land far, far away, there lived a King. The King had a very pretty daughter, but he also had a problem. His daughter was very fond of the numerous snakes that could be found in the palace grounds (she was pretty, but a little strange). Some of these snakes were poisonous and some were not, but nobody knew how to tell them apart. This worried the King greatly, because he did not want his daughter to be bitten by a poisonous snake before he could demand a huge dowry for her hand in marriage from the King of the much larger kingdom to the west. He knew he could not just get rid of all the snakes, because this would upset his beloved daughter, so he called in his most learned court scholars. When presented with the Kings dilemma, the Court Mathematician promptly announced �there is a new method called �cluster analysis� that I think may elucidate the problem�. The King, not being nearly as learned as the wise mathematician, replied �I didn�t know you could elucidate a snake, or why that would protect my daughter, but if you think it would help then your suggestion has my full support�. The mathematician was a little perplexed by the King�s answer, but was wise enough to know you did not question a king. The next day he set about making some measurements of the dead snakes he found in the palace grounds (being very wise, he realized that dead snakes, poisonous or otherwise, couldn�t bite him). He measured the length and diameter of each snake he found, as well as the length of the fangs. When he had collected enough data, he plotted the three measurements on a graph. He made sure to use the same scale for each axis, because he didn�t want to favor one measurement over another. This is what he saw:



There were clearly two species of snake! The only remaining problem was to determine which species was poisonous. Being very wise, he realized that although he could only do this using a live snake, along with a disposable prisoner from the King�s dungeon, he did not need to take unnecessary risks by catching more than one. It did not take long for the Court Mathematician to catch one of the larger snakes and determine that, unfortunately for the prisoner, it was poisonous. The next day the Court Mathematician took his findings to the King, who was immensely pleased. The King immediately ordered that all the larger snakes be captured and released over the border of the much smaller kingdom to the east (he did not like the King of the kingdom to the east, because many years before he had demanded a huge dowry for the hand in marriage of his very pretty daughter).

Time passed happily, until one day the King�s daughter was bitten by a snake and died. The King was furious, and summoned the Court Mathematician. �You told me that only the large snakes were poisonous, and now my daughter is dead. As a punishment that you will never forget you will be elucidated! Take him away!�

Eventually the ex-Court Mathematician recovered enough from his punishment to investigate where he had gone wrong. After much study he solved the problem by inventing two new techniques for data analysis, which he presented in a very high-pitched voice at the next inter-kingdom symposium on applied mathematics. He called these new techniques �mean centering� and �variance scaling�. When he applied these new techniques to his snake data, this is what he saw:



There were three species of snake! One of the two smaller species was also poisonous! The other mathematicians were so impressed they gave him a major award with a nice engraved plaque he could hang on his wall. Alas, he could never have a son to inherit the plaque and be proud of his father�s achievements.

There are two morals to this story:
1) If you want to continue to speak in a normal voice, and perhaps have children, do not anger kings
2) If you do not want to anger kings, scale your data correctly prior to analysis.

Richard

Ha. I loved it. VERY interesting. "The Story of the Hepatologist Pricess"

If you look at my work sheet you will actually find that the AC was very low, indicating that there wasn't much in the way of clustering at all, and it could be (should be??) argued that a value that low would mean that no groups exist at all under that approach (i.e. based on the measurements - they are all just "snakes" of different sizes).

My identification of 3 was based on the distance between groupings stayed relatively low until the last 2 combinations - indicating that the last two may not be appropriate. However, I would argue that even these 3 groups are so loosely clustered as not to be groups at all. In fact, I would argue that with the values you have - there is insufficient clustering to show distinct groups. Three is a small group indeed!

Even with scaling the data looks linear rather than clustered, lending it to Linear Discriminant Analysis (LDA) as the more appropriate tool.

Also - just because I haven't scaled data, doesn't that mean that the data can't be scaled appropriately before inclusion into my work sheet. I can scale the data as appropriate. A PCA or a Factor analysis, could be used to find a scaling of the dimensions, or a Factor Analysis could move the data in Euclidean space to a point that maximises the effect of variables equally, which would then be the appropriate transformations for the data to scale (I haven�t written the factor analysis stuff yet).

This may be why/how the PCA has been found appropriate to maximise the efficiency of the k-mean approach?? I haven't read into it yet.


However, it all comes back to "what's the question" - and "what's the data". I didn't know either, so had to assume that they were similar measurements in the same units.

For all I knew, they were measurements of height, breadth and depth.

I would still argue that on visual inspection the groups identified by either process are invalid and other processes should be used.


So - when all is said and done what should the mathematician done?

He should have gone to a biologist (like me) and asked what should be done. The biologist would have told the princess "DO NOT EAT ANY OF THE SNAKES".

A biologist would have known that only "venomous" snakes can kill you by a bite, "poisonous" snakes have to be eaten to kill you. Almost all snake venoms can be eaten safely (alkaloid toxins are neutralised in the stomach), venoms are only toxic when taken subcutaneously (under the skin). Poisons are always toxic when consumed. A substance can be both venomous and poisonous at the same time BUT the terms are NOT interchangeable, so such a substance would be termed venomous if introduced by injection.

So if the princess had just played with the snake instead of eating it - there would not have been a problem. The bite was just a distraction, or maybe what caused her to bite the snake back resulting in her death of poisoning???

Assuming that this is back in the days of alchemy when they didn't make the distinction.... or that the King and Mathematician were not learnered in the ways of nature... we can assume that they meant venomous.

They should have still asked a biologist.

Then the biologist would at least have known WHAT to observe - and would have noted that poisonous snakes only have fangs at the front of the mouth rather than the non-venomous back of the mouth snakes (distinctly different morphologies). The fangs themselves are different in shape for the intorudction of the venom through the fangs.

All snakes could have been collected - their mouth/fangs inspected and venomous ones given to the kindly king from the smaller kingdom.


Still - without over-thinking the issue - a nice little problem.

Philip
___________________
Correct answers don't require correct spelling.

On 2/19/2009 7:00:03 PM, pleitch wrote:

>If you look at my work sheet
>you will actually find that
>the AC was very low,
>indicating that there wasn't
>much in the way of clustering
>at all

See my other post for what I think about the AC metric.



>A PCA or a Factor analysis,
>could be used to find a
>scaling of the dimensions,

I don't see how you can find the scaling using PCA. In fact, PCA is very dependent upon the scaling of the data.

> or
>a Factor Analysis could move
>the data in Euclidean space to
>a point that maximises the
>effect of variables equally,
>which would then be the
>appropriate transformations
>for the data to scale (I
>haven�t written the factor
>analysis stuff yet).

I have 🙂 PCA, anyway. There are more types of factor analysis than I care to think about, let alone write in Mathcad. I wrote the PCA stuff for myself though, so I would have to do some work on it before I was prepared to post it. PCA is implemented in the Data Analysis Extension Pack though, so if you get that you can save yourself some effort.

>This may be why/how the PCA
>has been found appropriate to
>maximise the efficiency of the
>k-mean approach?? I haven't
>read into it yet.

No. PCA just lets you reduce the number of variables, assuming the variables in the raw data are collinear. That can make it much easier for the k-means clustering to find the clusters.


>So - when all is said and done
>what should the mathematician
>done?
>
>He should have gone to a
>biologist (like me) and asked
>what should be done.

He couldn't. The Court Biologist was the Princess. I just forgot to mention that point 😉

>Assuming that this is back in
>the days of alchemy when they
>didn't make the
>distinction....

They didn't. Except perhaps the princess, but of course she wasn't asked.

>They should have still asked a
>biologist.

I think the more important point is that if the King wanted a practical solution to a real world problem, he shouldn't have asked a mathematician 🙂

Richard

On 2/18/2009 7:52:41 PM, pleitch wrote:

>The thread was a little off
>with what it said. There are
>two types of cluster analysis,
>but they are Partitioning and
>Hierarchial.

Thanks for the clarification.

>Within each
>there are multiple approaches.
>The approach I posted is
>Hierarchial

That helps. I haven't had time yet to really look at your worksheet in detail (too busy writing fairy stories!), but it is much easier to figure out what's going on when you know that. I would suggest even adding a comment to that effect at the top of the sheet somewhere.

Richard

From the University of Wikipedia:

Relation between PCA and K-means clustering
It has been shown recently (2007) that the relaxed solution of K-means clustering, specified by the cluster indicators, is given by the PCA principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class scatter matrix. Thus PCA automatically projects to the subspace where the global solution of K-means clustering lie, and thus facilitate K-means clustering to find near-optimal solutions.



Philip
___________________
Correct answers don't require correct spelling.

I've looked at the data from the k-mean approach and I still don't see 4 groups.

However - it does draw a very good point. Are these the priciple components associated with the data?

Using the snake example, you wouldn't take random measurements of snakes and then assume that they will be related to how venomous they are.

Instead, you would attempt to determine which variables are principle in determining the venom attribute (or degree of venomousness). This may well be done after measurements are taken.

If there is no easy measure of venom - other than killing something/someone, then the principle components would logicallyb be the ones that are most useful in distinguishing types of snakes.

I have now attempted to make several views of the data, including a mean divided by stdev. Even then I don't see four groups.

With this data I would happily agree that when viewed from 2 of the three data axes, there does indeed look like 4 groups. But under the third these groups disipate. The final group (the most extreeme one) is so unclustered that if it is to be considered a group, so must the first group (the small one that arguably could be considered two groups).

However - even with the data transormation, the AC is so low as to assume that there is no clustering/grouping occuring at all.


Philip
___________________
Correct answers don't require correct spelling.

On 2/20/2009 8:00:47 AM, pleitch wrote:
>I've looked at the data from
>the k-mean approach and I
>still don't see 4 groups.

See below


>However - it does draw a very
>good point. Are these the
>priciple components associated
>with the data?

I have no idea, but I doubt it. It's just the example data that was with the Cluster 3.0 software, and nothing whatsoever is known about it. For all I know it's just made up.

>Using the snake example, you
>wouldn't take random
>measurements of snakes and
>then assume that they will be
>related to how venomous they
>are.

Well, perhaps not in the snakes example, but there are plenty of examples where that is what you do. You measure what you can, and then try to correlate that with the known property.

>Instead, you would attempt to
>determine which variables are
>principle in determining the
>venom attribute (or degree of
>venomousness). This may well
>be done after measurements are
>taken.

Exactly. So you measure everything you can, then correlate afterward. We in the spectroscopy world do it all the time.

>If there is no easy measure of
>venom - other than killing
>something/someone, then the
>principle components would
>logicallyb be the ones that
>are most useful in
>distinguishing types of
>snakes.

No, not necessarily. Principal components are based solely on variance in the data. That variance may not correlate with the property you wish to measure though. As an example, take 2 species of snake. They both have about the same length, and that length is highly variable. That's one variable you measure. Now let's say you measure 10 other variables with high variance but little or no discriminatory power (diameter, etc). Finally, you measure color. The snakes have very similar colors, but the within-snake color variation is very small, so you can tell them apart using this. You now have 11 variables, 10 of them with high variance that tell you nothing about the snake species, and one with very low variance that does. The PCs will be dominated by the high variance variables, not the color, and will not help solve the problem. In fact it will make it much worse, because it will take the one useful variable you did have and mix it up in linear combinations with all the others.

>I have now attempted to make
>several views of the data,
>including a mean divided by
>stdev. Even then I don't see
>four groups.
>
>With this data I would happily
>agree that when viewed from 2
>of the three data axes, there
>does indeed look like 4
>groups. But under the third
>these groups disipate.

You can't look at only 2 variables at a time. It's a 3 dimensional data set.

> The
>final group (the most extreeme
>one) is so unclustered that if
>it is to be considered a
>group, so must the first group
>(the small one that arguably
>could be considered two
>groups).

That's the one 🙂

However, I agree it's completely subjective, and since we know nothing about the data it could be any number of groups. Maybe it's just one badly sampled continuous distribution!


>However - even with the data
>transormation, the AC is so
>low as to assume that there is
>no clustering/grouping
>occuring at all.

I am wondering if you have the formula for AC correct. If you have, then I would consider it a rather useless metric, because even for the snake data it's only 0.19. The grouping in the snake data is obvious (for 2 groups, anyway).

Richard

You are right,

The King's daughter preferred baby snakes, shorter on the meter stick. Do you mind if I pass that lovely story to my best friend in statistics as well ?

Jean
RichardJ
19-Tanzanite
(To:ptc-1368288)

On 2/20/2009 10:40:12 AM, jmG wrote:
> Do you mind if I
>pass that lovely story to my
>best friend in statistics as
>well ?

You can pass it on to whoever you wish. Once I post something here I figure it's in the public domain anyway.

Richard


PhilipOakley
5-Regular Member
(To:RichardJ)

On 2/20/2009 10:01:05 AM, rijackson wrote:
>
>No, not necessarily. Principal
>components are based solely on variance
>in the data. That variance may not
>correlate with the property you wish to
>measure though. As an example, take 2
>species of snake. They both have about
>the same length, and that length is
>highly variable. That's one variable you
>measure. Now let's say you measure 10
>other variables with high variance but
>little or no discriminatory power
>(diameter, etc). Finally, you measure
>color. The snakes have very similar
>colors, but the within-snake color
>variation is very small, so you can tell
>them apart using this. You now have 11
>variables, 10 of them with high variance
>that tell you nothing about the snake
>species, and one with very low variance
>that does. The PCs will be dominated by
>the high variance variables, not the
>color, and will not help solve the
>problem. In fact it will make it much
>worse, because it will take the one
>useful variable you did have and mix it
>up in linear combinations with all the
>others.
>
>Richard

The techniques around PCA, such as SVD, etc., will identify the separate groups. In a multidimensional case one has to be cautious about misunderstandings (e.g. the separation axis may not be one of the dimensions). [This is noted for the general reader, rather than Richard who I believe already appreciates this].

We need to be careful in the explanations about where the particular distinctions are in each approach and how, often, they are different ends of the same calculation. The PCA and SVD method normally sort the components by various measures of size. Sometimes we want to start at the big end and some times the small end, depending what we want to achieve.

It is an optimisation problem. We are trying to optimise the separtion between putative groups based on various flexible criteria...

Philip Oakley

Thank you Richard

"I am wondering if you have the formula for AC correct."

It matched the book that I got it from but that DOES NOT mean that I have it correct. I will double check of the links you sent through. The version I have attempts to determine the % of distance graph (the bar graph I made) that is white. So if most of the values are low, then one or two very large values at the end (distant examples) (as happened in this example), then the amount of white space will be very high.

Anyway - my point about PCA - and more specifically factor analysis (next couple of projects I'll be doing), is that you can move the factor's effects in euclidean space to maximise the clustering..... I think. I'll give it a go anyway.


Philip
___________________
Correct answers don't require correct spelling.

On 2/20/2009 9:49:32 PM, pleitch wrote:
>Thank you Richard
>
>"I am wondering if you have
>the formula for AC correct.
>
>It matched the book that I got
>it from but that DOES NOT mean
>that I have it correct.

"R" is the programming language designed and used by statisticians. I would be truly amazed if they had it wrong.

>Anyway - my point about PCA -
>and more specifically factor
>analysis (next couple of
>projects I'll be doing), is
>that you can move the factor's
>effects in euclidean space to
>maximise the clustering..... I
>think.

With or without a-priori information about the data? With a-priori information about the data that's certainly true. Without it, I'm not so sure. You can extend PCA though to give much better classification. In my field a very successful algorithm is SIMCA:

http://www.camo.com/resources/simca.html

Beware though! I believe it is patented.

Richard
Top Tags