Central Limit Theorem

2 collaborators

Uri Wilensky (Author)

Dor Abrahamson (Author)

WHAT IS IT?

This demonstrates relations between population distributions and their sample mean distributions as well as the affect of sample size on this relation. In this model, a population is distributed by some variable, for instance by their total assets in thousands of dollars. The population is distributed randomly -- not necessarily 'normally' -- but sample means from this population nevertheless accumulate in a distribution that approaches a normal curve. The program allows for repeated sampling of individual specimens in the population.

This model is a part of the ProbLab curriculum. The ProbLab curriculum is currently under development at the CCL. For more information about the ProbLab Curriculum please refer to http://ccl.northwestern.edu/curriculum/ProbLab/.

HOW IT WORKS

Either the program or the user creates a population that ranges along some dimension, such as total assets. In the View, we see this population arranged in a "picture bar chart." For instance, poorer people are farther to the left, and richer people are farther to the right. Next, a group of individual specimens from this population is selected as a sample (these sampled people are painted in a different color). The program calculates the mean value of this sample -- their average assets -- and plots this mean in the histogram below the view. We can set the program to sample repeatedly, and we can observe the emergence of the distribution of sample means. We can also look at the corresponding histogram of the sums of the samples. This allows us to study the relation between the sums and the means in terms of properties of their distributions.

HOW TO USE IT

Press SETUP, and then press either CREATE RANDOM PEOPLE or CREATE MY OWN PEOPLE or just press one of the PRESET buttons. If you've created random people or if you have pressed one of the presets, you are now ready to press either GO ONCE or GO. But if you've pressed CREATE MY OWN PEOPLE, you now need to click on the View to create these people. Only then you should press either GO ONCE or GO. You can also vary the setting of the sliders. Watch results in the plot and explore relations between the settings and the results. If you'd like to guess the mean value of the sample means before you begin sampling, you can use the RANGE slider to indicate the location of this mean.

Buttons: SETUP -- initialize variables CREATE RANDOM PEOPLE -- activates SETUP and then creates a random population in the View CREATE MY OWN PEOPLE -- allows the user to create persons by clicking on the View PRESET 1 - PRESET 6 -- creates populations with special patterns of possible interest GO ONCE -- marks a sample in the population and calculates and plot their mean value. GO -- repeats GO ONCE indefinitely.

Sliders: SAMPLER SIZE -- determines the number of specimens sampled at each run through the Go procedure RANGE -- determines the range of values the members of the population can take.

Switches: ALSO-SUMS? -- if set to "On," the sums of the samples (and the means of these sums) is plotted as well as the means of the samples (and the mean of these means). If set to "Off," only the means are plotted. SHOW-EV? -- if set to "On," the EXPECTED VALUE (EV) monitor will display a value

Monitors: NUM-SAMPLES -- shows the total number of samples taken this the last 'setup.' EXPECTED VALUE -- calculates the mean x-value of the population. For each column, the program calculates the product of the number of turtles and the x-value of that column. Next, all these products are summed up and divided by the total number of turtles. STD-DEV-MEANS -- shows the standard deviation of the histogram of sample means. This is an index of how "diffuse" the distribution is. The smaller the number, the tighter (narrower, more clustered) is the distribution of sample means. STD-DEV-MEANS -- shows the standard deviation of the histogram of sample sums.

Plots: SAMPLE-MEAN DISTRIBUTION -- distribution of the mean values of all samples taken and the mean sum of all samples taken.

THINGS TO NOTICE

The property we are looking at is indexed by the "x-value" of the people in the bar chart that is in the View. A person's x-value can be seen in the numerical label at the bottom of its column in the view. For instance, the x-value of people in the left-most column is 0. There could, in principle, be no person with the x-value "0," there could be a single person with that value, or there could be two or more. They all share the same value, because they are all in the same column.

Members of the population that turn green when you press GO or GO ONCE are the 'sample.' Their mean x-value, for instance, their savings, is plotted in the histogram below the view. For example, if a sample of three people is taken and their x-values are 7, 8, and 12, then the histogram column "9" will bump up by one unit, because 9 is the mean of 7, 8, and 12.

For some settings of the population, the more samples you take the more likely you are to get a rare sample. So the distribution you get after only a few samples is not necessarily reflective of all possible mean values.

THINGS TO TRY

Can you see any connections between the distribution of the population (in the graphics display window) and the mean value of the histogram (in the plot window)? For instance, if there happen to be more population specimens ("people") on the left side of the range, where do you expect to see most of the sample means?

Try running the model with a SAMPLE-SIZE of just 1. What do you get. Now try with a SAMPLE-SIZE of 2. Has anything changed? How about a larger sample size?

Are there any connections between SAMPLE SIZE, RANGE, and STD-DEV? One way to explore this question is to keep two of these variables constant and examine what happens when you change the third variable. You may want to take an equal number of samples for each of these trials.

If you set the model to a sample size that is larger than the total size of the population, you will receive a message telling you cannot do this. However, you may set a sample size that is larger than most columns. This means that the entire sample cannot fit into those columns. Is this a problem? What does this do to the distribution of sample means?

Using the CREATE MY OWN PEOPLE option, build some "unusual" populations. Some of these have already been put into the PRESET buttons. For instance, you could create people only in one or two columns, or you could make the population "U-shaped" (more on the outside and less and less as you go towards the middle). What are your findings?

Again, using the CREATE MY OWN PEOPLE option, build one very tall column off on the right side of the view (about at x-value 8) and build a few very short columns. Set the SAMPLE-SIZE to 10. Press GO ONCE. What can you say about the number of persons that happened to be chosen from the tall column? Try this again and then press GO. Look at the plot. Do you see any connection between the chance of getting samples from the tall column and the location of the mean in the plot?

Set ALSO-SUMS? to "On," and activate the program. Can you explain the similarities and differences between the two histograms you get? For instance, you can look at their range, the total area they cover, their height, and their shape. Try to explain the transformation between the two histograms. For instance, why is the histogram on the left taller than the histogram on the right? Look at the standard deviation of the means and of the sums. What is the ratio between these two values? Does this ratio relate to any other value in the settings of this model?

Relating to the two histograms, can you find a case in which the two histograms converge to a single histogram?

Press SETUP, press CREATE MY OWN PEOPLE and make 6 columns of the height 2 (two persons), set the sample size to 2, and set the ALSO-SUMS? to "On." Now press GO. It is interesting to compare between this statistics activity and a probability activity in which you are rolling a pair of dice. For instance, how many possible columns are there in the sums histogram? In fact, such a comparison can help us think through similarities and differences between statistics and probability.

What is the relation between the number of samples you take, the size of each sample, and the resulting distribution of sample means? For instance, if you have a budge to sample 1,000 people, should you take 10 samples of size 100 each or 100 samples of size 10 each? What do you gain and what do you, perhaps, lose, in each of these choices? For instance, in terms of confidence or in terms of information about the population you are sampling from. To explore this question, you may want to extend the range of the sample size. You may also want to resize the view so as to allow for more specimens in your population. Finally, it may be helpful to have a slider and corresponding code for controlling the total number of samples you are taking.

PEDAGOGICAL NOTE

The first thing to remember is that in reality we do not know the distribution of the population from which we are sampling. We only have the plot, so to speak. So as you are interacting with this model, you should recall that in applied statistics the Graphic Display does not exist. In this model, however, we are simulating the population -- as if we do know its distribution -- in order to understand the relation between population metrics and their sample means distributions.

You may have noticed that, almost regardless of the shape of your population, the histogram always eventually takes on a certain shape. This shape is called a "normal curve" or "bell curve" or "bell-shaped curve." We say that the histogram "approaches" the normal curve as one takes more and more samples. For special population distributions, we may get special cases of this curve. For instance, if you have created a population that has all the people in the same column, your histogram will be an extreme case of a bell curve -- it itself will consist only of a single column.

Often, people say that, "a population is distributed...etc," but it could be that, sometimes, what they actually mean to say is that, "the sample-means of the population are distributed...etc." This does not imply that the second figure of speech is necessarily preferable, but only that we should understand the difference between these two ideas. In this model, the view shows how the population itself is distributed, whereas the plot shows how the sample means are distributed. Working with this model, one may be struck by the contrast between these two distributions.

The plot shows both the distribution of the sample means and the mean of these means. This mean of means converges on the expected value of sampling from this population. It can be calculated as the average x-value. That is, multiply the x and y values of each column, add these products, and divide the sum by the total number of data points ("persons"). For instance, if there are 3 persons over the 0 value, 2 persons over the 1 value, and 5 persons over the 2 value, the sum of the three x and y products is: 30 + 21 + 5*2 = 12. We now divide 12 by 10 (the total number of persons). 12 : 10 = 1.2. So if we sample from this population, the mean of the sample means will converge on 1.2. Try this. You can use the above example or any other example you invent.

The biggest challenge is to use this model so as to come up with an explanation of why, we almost always get a bell curve when we take enough samples.

Please note the following point of potential confusion. In order to enable close examination of the sampling process, the populations in this model contains fewer specimens than most populations that are commonly studied by researchers using statistical analysis. For instance, there might be no more than 5 individual specimens in a population who share the same x-value (that is, who are all in the same column). Therefore, a sample that is larger in size than the number of specimens in that column, say a sample of size 8, can never include specimens exclusively from that column. In "real life," it could theoretically happen that an entire random sample is taken from a single column. This means that if you use large sample sizes you should expect to get narrower sample-mean distributions than what one would otherwise expect. For instance, if the left-most column is not tall enough to contain the entire sample, you will never receive a sample mean that is equal in value to the x-value of that column. This is because some of the sample will "spill over" to the right of that column, resulting in a greater sample mean. Because the same logic holds for samples taken from the far right, the sample mean values will be closer to the center than they would be for small samples sizes.

EXTENDING THE MODEL

The current version of the model allows for repeated sampling of individual specimens. That means that if a person was selected randomly in the first sample, it can be sampled again in the second sample, etc. If we did not allow this repeated sampling, would the sample-mean distribution be affected at all? If so, how? Add code that allows the repeated sampling only as an option and compare the outcomes between the two options.

How do medians behave? The same way as means? Add an option to see the median of both the population and the sample data.

Add a monitor that shows the ratio between the two standard deviations represented in this model.

This model shows means and sums of sampled data. It may be interesting to look at other analyses of the samples. For instance, the product of all values of sampled people as well as the n-th root of this product (for a sample of size n).

How does the standard deviation change as we collect more and more samples? To examine this, you can add a plot of the standard deviation over "time" (over samples).

NETLOGO FEATURES

In the procedure draw we used a temporary variable, temp-mouse-xcor. This variable assures that the program won't become "confused." Without this variable, the program might enter a clause in the ifelse code that no longer satisfies what the user actually meant when s/he clicked with the mouse. This could occur, because the user is moving the mouse rapidly. So the program selects the ifelse clause that is correct at that moment, but meanwhile the mouse click would already be some place else, and so the clause selection would no longer be suitable. The temporary variable avoids this by going through with the instructions as though the mouse were still clicked down where it had been a moment ago.

In the current version, you can create the population, but you cannot experiment with the sampling. Build a procedure that allows you to take your own samples.

RELATED MODELS

Several ProbLab models look at the emergence of bell-shaped curves through the accumulation of sample means. See, for example, Prob Graphs Basic and Random Basic Advanced. To look closer at the idea of expected value, see the models Expected Value and Expected Value Advanced.

CREDITS AND REFERENCES

This model is a part of the ProbLab curriculum. The ProbLab Curriculum is currently under development at Northwestern's Center for Connected Learning and Computer-Based Modeling. . For more information about the ProbLab Curriculum please refer to http://ccl.northwestern.edu/curriculum/ProbLab/.

HOW TO CITE

If you mention this model in a publication, we ask that you include these citations for the model itself and for the NetLogo software:

Abrahamson, D. and Wilensky, U. (2005). NetLogo Central Limit Theorem model. http://ccl.northwestern.edu/netlogo/models/CentralLimitTheorem. Center for Connected Learning and Computer-Based Modeling, Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL.
Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL.

COPYRIGHT AND LICENSE

CC BY-NC-SA 3.0

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

Commercial licenses are also available. To inquire about commercial licenses, please contact Uri Wilensky at uri@northwestern.edu.

Comments and Questions

Click to Run Model

globals [
  sample-mean-list  ;; list of means of sample taken from the population
  sample-sum-list   ;; list of sums of sample totals
  regular-color     ;; color of specimens in the population
  chosen-color      ;; color of sampled specimens
  ranger            ;; holds value of range so that the range slider can be used for guessing
]

to setup
  clear-all
  set regular-color red + 3
  set chosen-color green
  set sample-mean-list []
  set sample-sum-list []
  ask patches [ set pcolor white - 2 ]
  create-x-line-labels
  reset-ticks
end 

;; the View shows a "picture bar chart." Bottom patches display the 'x value" of this chart

to create-x-line-labels
  ask patches with [pycor = min-pycor]
    [
      set plabel-color black
      ;; to avoid congestion of labels, we ask only every other patch to display a label
      if pxcor mod 2 = 1 [ set plabel ( pxcor + max-pxcor ) ]
    ]
end 

to create-population ;; CREATE-RANDOM-PEOPLE
  setup
  set ranger range + 1

  ;; colors patches in the range white, others gray
  ask patches
  [
    ifelse pxcor <= ranger + min-pxcor - 1
      [ set pcolor white ]
      [ set pcolor white - 2 ]
  ]

  ;; creates for each column of patches, beginning from the left and moving to the end of the range,
  ;; a random number of specimens ("people"). These are stacked up.
  let counter min-pxcor
  let column-help 0

  ;; ranger is 1 more than range. We add 1 to the range, because the first "x value" is 0
  repeat ranger
  [
    set column-help random-pycor
    ask patches with [ (pxcor = counter) and (pycor > min-pycor)]
      [
        if pycor < column-help
          [ sprout-person ]
      ]
    set counter counter + 1
  ]
end 

to sprout-person
   sprout 1 [
    set shape "face neutral"
    set color regular-color ]
end 

;; procedure allowing users to select columns where new specimens are created

to draw-your-own-people  ;; CREATE-MY-OWN-PEOPLE button
  ask patches [ set pcolor white]
  create-x-line-labels
  set ranger range + 1

  ;; we use a temp-mouse-xcor to avoid confusion when the user moves the mouse rapidly
  let temp-mouse-xcor "N/A"
  ;; each column has a "top-patch." It will be the lowest patch that does not have a person turtle in it
  let top-patch "N/A"

  ;; if there still is room for a new person in the column, a new person will appear just above the highest person there
  if mouse-down? [
      if not ( round mouse-ycor = min-pxcor ) [
      set temp-mouse-xcor mouse-xcor
      ;; locates the top-most patch, in the column where you click, that has a person in it, and assigns the patch above it
      ;; If there are no persons in the column, the top-patch is the bottom patch in the column
      ifelse any? patches with [ ( any? turtles-here ) and ( pxcor = round temp-mouse-xcor) ]
        [
          set top-patch patch (round temp-mouse-xcor)
            ;; we do not want to assign top-patch a pycor of a patch outside the world
             min list ( max-pycor )
                     (1 + max [ pycor ] of patches with [ ( any? turtles-here ) and ( pxcor = round temp-mouse-xcor) ] )
        ]
        [
          set top-patch patch (round temp-mouse-xcor) (1 + min-pycor)
        ]  ;; there is a possibility that the very top patch is already occupied, so in that case we do not create a new turtle
        ask top-patch [ if not any? turtles-here [ sprout-person ] ]
      ]
    ]

  display
  wait .2
end 

to go
  reset-turtles

  ;; we check to make sure that there are enough turtles to sample
  ifelse sample-size <= count turtles
  [
    ask n-of sample-size turtles
    [
      set shape "face happy"
      set color chosen-color
    ]
  ]
  [
    user-message word "There are not enough people to take a sample of this size."
                      "\n\nEither create more people or choose a smaller sample size"
    stop
  ]
  tick
  calculate-and-plot-sample-stuff
end 

to reset-turtles
  ask turtles
  [
    set shape "face neutral"
    set color regular-color
  ]
end 

to calculate-and-plot-sample-stuff
  ;; gets the mean and the sum of the sample, then plots these
  ;; we add max-pxcor to compensate for the negative values of xcor
  let temp-mean mean [ xcor + max-pxcor ] of turtles with [ color = chosen-color ]
  set sample-mean-list ( lput temp-mean sample-mean-list )
  let temp-sum sum [ xcor + max-pxcor ] of turtles with [ color = chosen-color ]
  set sample-sum-list ( lput temp-sum sample-sum-list )

  ifelse also-sums?
  [
    ;; adjusts the range of the plot to include all the values from the sums list
    ;; There is a possibility that the maximum sum is less than the range, so we include it, too
    set-plot-x-range 0 (1 + max sentence sample-sum-list (ranger - 1) )
    set-current-plot-pen "sums"
    histogram sample-sum-list
    ;; plots the histogram of the sums of sample means
    set-current-plot-pen "sums-mean"
    plot-pen-reset
    plot-pen-up
    plotxy (mean sample-sum-list) 0
    plot-pen-down
    plotxy (mean sample-sum-list) plot-y-max
  ]
  [
    ;; clears the sums histogram and mean, then adjusts the range of the plot
    set-current-plot-pen "sums"
    plot-pen-reset
    set-current-plot-pen "sums-mean"
    plot-pen-reset
    set-plot-x-range 0 ranger
  ]

  ;; plots the histogram of the means of sample means as well as their mean
  set-current-plot-pen "means"
  histogram sample-mean-list
  set-current-plot-pen "means-mean"
  plot-pen-reset
  plot-pen-up
  plotxy (mean sample-mean-list) 0
  plot-pen-down
  plotxy (mean sample-mean-list) plot-y-max
end 

;; the presets are suggested population distributions

to preset-setup
  setup
  set range 30
  set ranger range + 1
  ask patches [ set pcolor white]
end 

to preset1
  preset-setup
  ask patches with [ (pycor > 0 + min-pycor) and (pycor < 0) ] [ sprout-person ]
end 

to preset2
  preset-setup
  ask patches with [ (pycor > 0 + min-pycor) and
                     (pycor < min-pxcor + abs pxcor ) ] [ sprout-person ]
end 

to preset3
  preset-setup
  ask patches with [ (pycor > 0 + min-pycor) and (pxcor mod 2 = 1) ] [ sprout-person ]
end 

to preset4
  preset-setup
  ask patches with [ (pycor > 0 + min-pycor) and (abs pxcor = max-pxcor) ] [ sprout-person ]
end 

to preset5
  preset-setup
  ask patches with [ (pycor > 0 + min-pycor) and (pycor < 2 + (- abs pxcor)) ] [ sprout-person ]
end 

to preset6
  preset-setup
  ask patches with [ (pycor > 0 + min-pycor) and (pycor <  1 + pxcor) ] [ sprout-person ]
end 

to-report std-dev-sums
  ifelse also-sums?
  [ report standard-deviation sample-sum-list ]
  [ report "N/A" ]
end 

to-report expected-value
  ;; from each column of patches, we get the number of turtles multiplied by the patch's "x-value"
  ;; Next, we calculate the mean of this list, to get the expected value of the population
  let columns-list [ ( count turtles with [ xcor = [pxcor] of myself ] ) * ( pxcor + max-pxcor ) ] of patches with [pycor = max-pycor]

  if show-EV? [ report (sum columns-list) / (count turtles ) ]
end 


; Copyright 2005 Uri Wilensky.
; See Info tab for full copyright and license.

There are 15 versions of this model.

Uploaded by	When	Description	Download
Uri Wilensky	over 12 years ago	Updated to NetLogo 5.0.4	Download this version
Uri Wilensky	about 13 years ago	Updated version tag	Download this version
Uri Wilensky	about 13 years ago	Updated to version from NetLogo 5.0.3 distribution	Download this version
Uri Wilensky	almost 14 years ago	Updated to NetLogo 5.0	Download this version
Uri Wilensky	over 15 years ago	Updated from NetLogo 4.1	Download this version
Uri Wilensky	over 15 years ago	Updated from NetLogo 4.1	Download this version
Uri Wilensky	over 15 years ago	Updated from NetLogo 4.1	Download this version
Uri Wilensky	over 15 years ago	Updated from NetLogo 4.1	Download this version
Uri Wilensky	over 15 years ago	Updated from NetLogo 4.1	Download this version
Uri Wilensky	over 15 years ago	Updated from NetLogo 4.1	Download this version
Uri Wilensky	over 15 years ago	Updated from NetLogo 4.1	Download this version
Uri Wilensky	over 15 years ago	Updated from NetLogo 4.1	Download this version
Uri Wilensky	over 15 years ago	Model from NetLogo distribution	Download this version
Uri Wilensky	over 15 years ago	Model from NetLogo distribution	Download this version
Uri Wilensky	over 15 years ago	Central Limit Theorem	Download this version

Attached files

File	Type	Description	Last updated
Central Limit Theorem.png	preview	Preview for 'Central Limit Theorem'	over 12 years ago, by Uri Wilensky	Download

This model does not have any ancestors.

This model does not have any descendants.

NetLogo