PROFESSOR: Morning, everybody.
Thank you for being here.
Happy Friday.
Today is a lecture on histograms.
Before that, the usual announcements,
of which there really aren't any other than--
well, I'm looking at that page, and it's not yet updated.
But homework two was due yesterday,
so that's not really an announcement.
Homework three will be released today,
and then it's the usual Thursday deadline,
but you get a point for turning it
in on Wednesday-- nothing new there.
Next Friday-- next Friday is special.
Next Friday, we will release project one--
or at least we'll discuss the release of project one.
And it's even more special because one
of my fellow instructors is going
to be here to give that lecture, John De Niro
will be lecturing on Friday.
And he is a legend, you should absolutely hear him.
I've had more years teaching than him,
but every time I watch him I learn something new.
So please show up.
All right, that was announcements, short and sweet.
Here's the plan for today--
You now have many table methods. You
know how to use data in tables in quite varied ways.
Today, we are going to talk about distributions--
I will define what that means in just a moment--
and we'll talk about visualizing them,
both numerical and categorical.
And so before we get started on those, let's
just remind ourselves of some terminology.
Types of data-- we discussed this briefly last time--
data can be of many types.
Two common types are numerical and categorical.
So numerical is just data that are numbers--
heights, weights, temperatures, number of students in a class,
all of these.
And of course, being numbers, they
are subject to the rules that govern numbers.
They have relations to each other--
15.2 is bigger than 13, it's not less, and so on.
Categorical is another common type where the data aren't
numbers, they are categories.
So it could be your favorite color, ethnicity,
which year you are in school--
so freshman, sophomore, junior, senior.
And let's see-- not satisfied, somewhat satisfied, highly
satisfied, OK, those are all categorical.
Now that last one has a natural ordering,
as does freshman, sophomore, junior, senior.
But vanilla, chocolate, strawberry doesn't necessarily
have an ordering, unless you are very particularly
interested in one of them.
So these are two common types of data.
There are many other types of data-- data can be maps,
they can be music, they can be anything.
But these are two common types.
Yes?
STUDENT: Are data like zip codes numerical or--
PROFESSOR: Are data like zip codes numerical or categorical?
That is an excellent question.
Let's try to answer it.
So if I look at zip code 94720, that's a number,
it's an integer.
Yes?
But then let's look at another zip code--
94704, that's also a number, yes?
Does it make sense to take the difference of the two?
Well, you can minus, you'll get another number,
but that number doesn't have an interpretation.
So in some sense, they are just labels.
And it's a short way of saying, it's this region in Berkeley.
So that is a good point that some data that
look numerical are actually categorical,
and we saw that last time with-- you know,
there was sex was classified as zero, one, two, or something
like that.
OK, so some terminology that we've been using throughout.
Individual is a unit whose features are being recorded.
A variable, we've used several words--
variable, feature, attribute.
And across different individuals,
a variable has different values.
So if the variable is favorite color, then
across different people, the value of the color
will be different.
Variables can be numerical or categorical subtypes
within these, as we have discussed.
And that should say, and many other types besides.
But we are going to focus, especially today,
on numerical and categorical variables.
The key is, when you look at an individual,
the categories are defined so that the individual fits
in one and only one category.
No blurring, no overlapping, that
makes counting and organization complicated.
Ok. So what's a distribution?
What I'd like you to imagine is the individuals
are all the students in this class.
The variable is their year in school--
so freshman, sophomore, junior, senior,
why don't we call graduate students just graduate.
So the variable is the year in school,
the values of the variable are freshman, sophomore,
junior, senior, graduate.
Distribution takes each value, like freshmen,
and counts how many students are corresponded to that value.
So how many freshmen, how many sophomores,
how many juniors, how many seniors,
how many graduate students?
It says how the class is distributed over the values
of the variable.
So in a distribution, each person
should appear once and only once in the categories,
and everybody should appear.
And so what we're going to do is we're
going to start with a categorical distribution--
a distribution of a categorical variable.
And we are going to visualize it.
And you have seen these visualizations
since you were little, we're just giving names to things
that you know.
And we're going to start out with a bar chart.
And I don't know what it is, when
we developed this material, we must have
been in some real movie kick.
OK, so these are the 200 top grossing movies
up to the year 2017.
And let me be clear about what top grossing means--
it's how much ticket revenue there was.
Now, you can see that Gone with the Wind is a pretty old movie,
it was released in 1939.
Ticket prices were nice and low at that stage--
so comparing the total dollar amount
that was taken in for tickets in 1939
to the total dollar amount that is taken in by a movie
now, when we're all paying upwards of $10
to go see a movie, is not fair.
So that column, gross, is number of dollars of tickets
that was sold in that year.
The column gross adjusted is that same number of tickets,
but sold at 2017 prices.
OK, so basically it's a comparison
of how many tickets were sold, and what the ticket price was.
And you can see that on that scale,
Gone with the Wind just blows everybody
else out of the water.
So the rows have been sorted in decreasing order
of the adjusted gross amount.
I'll pause for a second, I'll let you take a look.
Questions about what the rows represent?
All right, so the individuals are movies, and there is studio--
which is a categorical variable--
there are gross and gross adjusted,
which are numerical variables.
The year, you can think of as a label,
but also, it's reasonable to do some arithmetic on them.
So you can take a difference of years,
and you get the amount of time in between.
So we will, today, think of it as a numerical variable.
What I want to do is I want to look
at this column called studios.
So for each movie, it's the studio that released the movie.
And you can see that Fox appears twice,
Paramount appears twice, Disney, surely,
if you look at other movies it's going to appear more than once.
And so that variable, studio, has a distribution.
There are certain distinct studios,
and there's a count of how many movies were each studio,
and we're going to try to first get hold of that distribution.
So what we'll do is just work with the studios.
Each movie corresponds to one studio,
that's the same as saying each row corresponds to one studio.
And I just want to see, for each studio,
how many times did it appear in this data set?
And there's a method that does that.
It's called Group.
And here's what Group does--
this is just an introduction to Group,
next week, we'll do Group in all its detail.
It takes all the rows, groups them by the distinct value--
so it'll put everything that says Fox in one lot,
and then count how many there are.
So Group did the counting for you.
It said, the studios are ABC, old Buena Vista--
that's a Disney thing--
Columbia, Disney, Dreamworks, you
can see that it's put them in alphabetical order.
One of the movies was released by ABC, or 35
by Buena Vista, nine by Columbia,
and so on-- it's done the counting for you.
Let me give this thing a name.
All right, so that's fine.
So now we can see at one go what were all the distinct studios,
and how many movies came from each studio.
But that is a little hard to see,
so it's a great idea to draw a picture.
And what we're going to do is we're going to take this,
and we're going to draw a bar chart.
Why do I keep typing a dot?
I should know better.
OK bar H, because it is going to be a horizontal bar
chart, for reasons I will explain in just a moment.
Here's what you have to give bar H. You have to say,
OK, these guys are all the distinct values
of the categorical variable.
This is the variable I want summarized, so give it
the label of the column.
And you say, use these as the categories,
and these to help you draw the length of the bars.
So you just give it the name of your categorical variable.
And it will use the other as the length of the bar.
And if I do this, I get that.
So this says however many studios it has,
and a bar for each studio, and it's
a way of getting a visualization of that distribution
that we had.
And a couple of things to note here--
a horizontal bar graph because, if you had it vertical,
labeling is difficult. The names of the studios
run into each other.
The names of the studios are long,
and so by the time you have Paramount and Paramount
Dreamworks, putting them next to each other,
they'll just overlap.
So horizontal bar chart.
OK, all right, important thing to note--
on the horizontal axis, you have a numerical variable.
It's the count-- it's how many movies were released.
On the vertical axis, you don't have numbers,
you have categories.
That means one thing for a start.
There is no particular order to these.
They are in alphabet order just because that's what Group did.
But there is no reason for them to be in alphabetical order.
In fact, putting them in alphabetical order
makes the graph difficult to understand.
What's the natural thing to do here?
Sort-- sort by what?
Sort by descending, or in descending order
of number of movies.
So if we go back and take our table,
and sort by the counts column--
that is a much easier to interpret graph.
So, I mean, you can see that Buena Vista had
the most, and then Warner Brothers,
and so on, and so forth.
Crucial thing to note is we were able to permute--
that is rearrange the bars--
because they are categorical.
There is no natural ordering.
With numbers, you can't do this.
You can't suddenly put five below three.
The other thing to note is these bars have widths.
And I don't know how many of you have thought
about the width of bars in bar chats,
but I think you understand now once you
are thinking about them that it's
entirely up to the designer.
There's no numerical reason for the bar
to be of a certain width--
the width is usually decided based
on appearance, how much fits on a screen,
and so on and so forth.
So it's basically a designer's choice, what the width is.
So I also noticed there are gaps between the bars.
How wide those gaps are is also designer's choice.
And it makes sense that all the bars have the same width,
and all the gaps are the same, because otherwise, you
give more emphasis to one rather than the other.
And that's so silly.
And why am I noticing these rather obvious things?
Because when you get to numbers, you
don't have any of these choices.
And you have to make your choices that
are consistent with the order of the numbers.
All right, questions thus far?
OK.
When we get to numerical variables,
there is one thing that happens immediately.
So now, imagine that your variable is height.
So one person is 68.3 inches tall,
somebody else is 62.5 inches tall,
somebody else is 72.1 inches tall, and so on.
So there is a value for every individual,
just as there was a value for every individual here--
every movie came from a studio.
Now, it's very easy to assign the movie to the right group--
you just use the name of the studio,
you stick it into that group.
With numerical variables, you might not
want the level of detail that says,
how many people were 68.3 inches tall?
How many people were 68.4 inches tall?
How many people were 68.5 inches tall?
You might want to just say, OK, look,
I'm going to look at how many people are between 68
and 70 inches tall, and then 70 and 72, and 60 and 65,
whatever.
Right?
So that choice is up to you.
Those are called bins--
you put the values in bins.
So unlike categorical variables, or numerical variables,
you have to start with what is called binning.
That is a summary of what we just
did for the categorical variable--
you've got a distribution, which you get by using Group.
And you display a bar chart, one bar for each category.
And the length of the bars is the number
of individuals in the category.
Numerical variable, you start with binning.
And so we have to look at binning
in a little bit of detail, so let's do that now.
If you bin numerical values, what you're doing
is each bin is a range from here to there,
and you are counting the number of individuals in the range.
And always, there is an issue about the endpoints--
what happens to the people at the end points?
And so there are conventions that are used.
And the convention that we are going to use
is consistent with Python's convention,
that every range includes the left end, but not the right.
So if that's a data set, and I have decided to use these bins,
how do you start binning those values?
Well, 188 goes there.
And I want you to look at that notation-- that's
the 185 to 190 bin.
Do you see the square bracket at the 185?
And the parentheses at the 190?
Yes?
That is math code for the 185 is actually
included in that interval, but 190 is not.
So often you will see closed at the left end,
open at the right.
All right, so the next one is 170.
Could you please talk to your neighbor
and figure out where is it going to go?
Is it going to go to the left of 170,
or is it going to go to the right of 170?
Quickly.
So here, or here?
All right, votes for here?
Votes for here?
All right, and there are some people who
just will never vote at all.
Yeah, OK, so you're doing well.
170 is there, because the 170 to 175 interval includes the 170.
The 165 to 170 interval does not include the 170,
by our convention.
You can have the opposite convention if you wish,
but the methods that you are going to be using here
in the Data Science Library use this convention,
it's consistent with Python.
OK, so here we go.
Basically stacking the bricks, one above the other.
And can you see something that's awfully like a bar chart?
That is emerging very much like a bar chart,
and so we have to be able to just do this in general,
without us having to go through one at a time, counting values.
Agreed?
All right, so we are going to now-- the first thing we're
going to do is we are going to get
Python to do the binning for us, so we
don't have to go through this.
And then we'll see how to do the visualization.
So what I like to do is to construct
a numerical variable-- that is the ages of the movies.
Let me remind you what the data set looks like.
So you've got the year in which they were released,
so we'll just take 2018, subtract off the year,
and we'll get the age.
And what I'd like to do is I'd like
to look at the distribution of the ages of these movies.
So I get to choose bins.
And so, in order to choose the bin,
it's a great idea to know which was the oldest one,
and which was the newest one.
Or not which one, but how old was
the youngest and the oldest.
So if I run that--
the youngest one was one year old.
What was the big, huge movie that was released last year,
do you think?
Yeah, Star Wars, it's always Star Wars.
All right, and then one of them is 97 years old.
So I chose some bins.
OK, so bins are specified as an array.
An array of what?
An array of the left end points.
So zero to five, five to 10, 10 to 15, 15 to 25, et cetera.
100 is a curiosity--
I'm ending at 100.
What happens beyond 100, we'll discuss in just a moment.
But when you see bins as an array, it should be increasing.
And they represent the left endpoints
of all your intervals.
All right, so let's bin our movies--
top dot, and the method is called Bin.
You have to say you are binning according to what?
We are binning according to the age column.
And what are the bins?
They are in the array that I have called My Bins.
So two arguments-- the label of the column,
which is your variable, and the array that are your bins.
What have I done?
Did I not run my bins?
I probably didn't run my bins.
Yep-- OK.
That's the difficulty of having typed your code in ahead
of time.
All right, so what happened here?
We need to learn to read this table.
That first line is not a line by itself--
it actually includes the five of the second line.
That first line says, for the bin whose left end is zero,
there are 21 movies in that bin.
All right, if the left end is zero, where will the right end?
Five, right?
So the ages that are counted in the first bin
are zero, one, two, three, four--
not five.
Five is counted in the next one.
OK so far?
And so if you go down, that's the whole lot.
And there is this curiosity at 100--
I just want to make sure that--
there were 200 movies in all.
What should the answer be?
What am I adding up?
Which column?
Age count, yes?
If I add them up, what should I get?
Every movie appears once and only once in there.
I should get the number of movies.
There were 200 movies in all, let's see.
Good.
So what you're seeing is that this is a distribution.
Every movie appears once, and exactly once in there.
OK, this 100 you should think of as the right end
of the last bar, 65 to 100.
That zero says, I'm not counting anything beyond that.
In fact, there is nothing beyond that, we know.
The largest one is 97.
So typically, your last and your end will be chosen in that way,
slightly beyond your last one.
And somebody is going to ask, well, what
if there were something at 100?
We'll see in just a moment.
So I chose my bins-- notice, it's up to me
to choose my bins.
I chose my bins so that they're not equally wide--
you see that?
Zero to five is five years wide, five to 10 is five years wide,
10 to 15 is five years wide, but then, 15 to 25
is 10 years wide.
And 25 to 40 is 15 years wide, and so on and so forth.
And with numerical data, bins don't have to be equal.
And this is the crucial thing that
distinguishes the display of numerical data distribution
to a bar chart.
Bins do not have to be equal.
And you have to take that into account when we do the display,
as you will see.
But you could have the bins to be equal.
You can specify the bins to be sequential-- a range of values
separated by a step.
How about we do zero--
I just have bins of width 25.
So I did a recount--
zero to 25, not including 25, 25 to 50,
not including 50, et cetera, et cetera.
And you have another count, you can totally
do that, no problem.
Some of you will be wondering at this point,
why are we doing this fussing at the edge?
Why don't we just say zero to 24?
25 onwards?
It is because numbers don't have to be whole numbers.
So you can have arbitrarily many decimal places,
and then you will get very, very close to edges with your data.
And so this is a way of just taking
care of all kinds of numerical variables at once.
All right, what I want to do is something
that there is really no natural reason for doing,
but if you did it--
so notice that I've always gone beyond the range of the data.
What if, instead, for some reason known only
to the gods of weirdness, I stopped at 60.
Then that range would be--
I would have zero, I would have 25,
I would have 50, and no more.
Agreed?
So it's actually, because of the step,
it's the same as stopping at 50.
If I do that, look what happens.
I want you to notice something--
we stopped at 50.
Bin stopped counting.
We know that there are movies beyond that,
but it's not going to tell us.
That's the first thing to note.
The second thing to note is this 25 to 50 bin, yeah?
And it says that there are 68 elements in it.
Could you go up and look at that 25 to 50 bin?
How many elements are there?
66, it's the same movies.
What is going on?
There is something that has happened here.
And what Bin does is, this last bin
always needs a little bit of extra care.
So what bin does is if there are movies
that are right exactly 50 years old,
it throws them into the last bin.
And so now, if that is true, then looking at this 68,
compared to this 66, how many movies
were exactly 50 years old?
Two-- it better be two.
Well, let's check.
There you go.
Those two movies were exactly 50 years old.
If there are values that are exactly
at the edge of the last bar, those
are thrown into the last bin.
So the last bin typically has a different status
than all the other bins.
It includes both end points.
And this has to be done every time you are binning-- you have
to take care of the end bin.
So all I have done is given you the conventions
that are used by the Bin method for counting and placing
the elements into bins.
It's merely a decision that was made
by the people who defined the function,
and it's a very reasonable decision.
And such decisions are made conventionally, whether or not
you're using the system.
Some people like to include the right end and the left,
that's their choice.
Then they have to take care of the first bin.
All right, just a long list of conventions so far.
So this is the method of binning.
So now, what would we want to do?
Well, we would want to say, OK, why don't I
just do my bins again?
So if I do--
so that's a distribution.
And I want to visualize this distribution.
Then I'm going to draw a picture, which
looks awfully like a bar chart.
But before I do that, I have to remember that,
you remember the widths were different?
So I have to be thinking about how that
is going to affect my diagram.
And it is very important, when you
are doing any kind of visualization,
where you're using sort of regions
to represent numerical quantities,
that you are conscious of the area principal.
And let's look at an example.
This is a graphic from Gizmodo.
And it says, the battery size in what was then
the new iPad versus that of the iPad 2,
and the new one was supposed to be 70% larger than the old one.
So let me just remind you what 70% larger means--
if the new one was twice as large as the old one,
it would be 100% larger.
Right?
You have the old one, and another one again.
So 70% larger is not quite that much.
The old one, and much of another one--
70% of an old one. Right?
Now look at those two--
that's supposed to represent the old battery,
and that's supposed to represent the new battery.
I think I can take this new one and fit at least two of them
into that old one.
I'm sorry, I think I can take this small one,
and fit at least two of them into the big one.
Do you agree?
Just visually.
So it does not look to me like that is 70% larger than this.
It looks to me like it's way way, way more larger.
All right?
So what has happened here?
What has happened is you are picking up--
your eye is picking up large as area.
That's what you're seeing as big--
not just length, not just width, but area,
which includes both dimensions.
And I believe what has been done in this graphic
is that they have increased both dimensions by 70%.
And so then the area gets multiplied,
and then it's way bigger.
So if you double both sides, your area
will get multiplied by four.
So that is the thing to avoid.
And the main thing for you to keep in
mind is the area principle-- whatever you are trying
to represent-- if you represent, say, the number 20%
by that one triangle, then representing 40% by those two
triangles is just fine.
That is accurate.
But if you represent 40% by a triangle that
is double both in width and in height, you've messed up.
Because this one has four times the size of that little one
there.
And you can play a little game-- you can take those two
and you can fit it in here, and there are two more
still that you can fit.
So the area principle says that it's
the areas, not the length and the width,
but the areas that should be proportional to the size
that you are trying to represent.
And that's what we are going to use when we draw
what is known as a histogram.
So a histogram is a chart which looks like a bar chart,
but isn't.
It displays a distribution of a numerical variable.
There is a bar corresponding to each bin,
you've already chosen the bins.
And crucially, it uses the area principle--
it is the area of the bar, not the height.
The area of the bar that represents
the percent of individuals in the corresponding bin.
And so we are now going to draw some histograms.
OK, so these are my bins, and these are the interval,
and so for histogram, the method is called Hist.
You specify the numerical variable of interest.
You specify the bins.
And I'm going to use these weird bins that I decided to use.
And you know today, because we're
going to be very interested in exactly what is measuring
what, I'm also going to specify the units of measurement.
So the variable is age, and is measured in years.
So I'm just going to say the unit is year.
And I am going to run this.
So method is Hist.
Required argument, the variable.
And then, optionally, you can specify
your bins and the units.
Looks kind of like a bar graph.
Let's just see what the--
OK, so when you look at a graph, the eye
naturally goes to the vertical.
What I'm going to ask you to do as data scientists
is to discipline yourselves to first start
with the horizontal.
Forget the vertical-- just look at the horizontal axis.
What's happening here?
There are intervals-- these three are skinny,
then they are wider, and so on and so forth.
So this is the number line.
Why has it been broken up like that?
It has been broken up according to these weird bins.
The first three bins have the same size,
and the rest of wider.
All OK?
That's the horizontal axis.
It is drawn to scale.
These three bins, of width five, they're all equal.
This is of width 10, it's double this one.
You don't have the option of making this bin the same width
as this one.
There are numbers-- so it's twice as wide.
And then there is a rectangular bar
corresponding to the percent of movies in each bin.
And let's ignore the vertical axis,
I just want you to look at a few numbers.
This is the five to 10 bin, and this is the 65 and up bin.
So in the five to 10 bin, there are 17 movies.
In the 65 and up bin, there are 15 movies about the same.
Do you see that this rectangle looks
very much like this rectangle?
Maybe it's a little bit bigger.
The one to the right is on its side.
But you see that the areas are roughly the same?
Yes?
All right, so these bars are representing the areas
in a natural way--
It's the way that we want them to be represented.
But then there's this weirdness here, percent per year.
And people are going, what?
Why percent per year?
Why not just plot the counts?
Just plot how many movies there are in each bin.
So I am going to do that, and it is
going to give me the heebie jeebies,
and I'm going to freak out.
And then I'm going to quickly redraw the diagram properly.
But I'm going to do what people wants done,
which is just draw the counts.
Why are you fussing with the vertical axis?
To do that, I need the same call.
What we've done is that this histogram has been normalized
so that the total area is 100%.
It is following the area principle.
But you can say, you know what?
I don't want any of that.
Don't normalize it.
Just give me the counts--
don't do percents, just give me counts.
And now you have that thing.
All right?
I'm not even going to face the back, because it's just, ugh.
Why am I freaking out in this way?
I am freaking out in this way, because I
am looking at this thing that is supposed to represent
15 movies, and comparing to this thing that is supposed
to be representing 17 movies--
and it is just wrong.
This is the reason we don't plot the counts-- the bars are
uneven.
And what your eye is picking up as a big bar,
is a bar with high area.
If you forget that the widths are all different,
and just plot the counts, you will get weirdnesses like this.
So that does not represent, visually, 15 and 17, at all.
And so just to make myself feel better, yes, thank you.
This thing is called a histogram.
And the area of each bar is the percent
of individuals in the bar.
Yes?
STUDENT: [INAUDIBLE]
PROFESSOR: So why do we have uneven bars?
That is a good question.
It was my choice.
Usually I use uneven bars when I am less interested
in detail in some regions, and more interested
in some other regions.
So if, for example, I was looking at incomes,
and I was interested in low to middle income,
I would probably have a lot of bars there.
And then beyond that, I didn't care so much,
then I would have wide bars.
All right, this figure is called a histogram.
We know that its horizontal axis is the number line--
it is drawn to scale.
And so you can't have arbitrary gaps in between the intervals,
either.
And that areas represent the percents-- let me
just quickly-- so there's a question of, then,
how did it figure out the heights?
We'll talk about that for the rest of the lecture.
OK, so the axes--
the total area sums to 100%.
I haven't left anybody out, it's everybody.
And we are using percents--
the area of each bar is a percent.
The area represents the percent, that's the area principle.
Horizontal axis is the number line.
The vertical axis-- now what's the vertical axis?
It's a rate.
Let's take a look.
Actually, before I take a look, questions thus far.
Yeah?
STUDENT: [INAUDIBLE]
PROFESSOR: What's the default for bins if we don't set it?
That?
It chooses 10 bins of equal size.
I think it goes min to max, and just divide by 10.
All right, so-- you've got a bar, it's a rectangle.
Yes?
So area is width times height.
You agree?
All right, so for all these bars, the width
we know-- let me actually redraw the histogram.
I know the widths because I set them.
I know the areas, because that was counted by Bin.
I need to figure out the height, and you'll agree that--
yes?
Area is equal to height times width-- so how
do you figure out the height?
You take the area, and you divide by the width.
That's what's happened.
So let's take a look at the distribution again.
So I'm just binning, I'm not doing a histogram.
Bins equals, I was calling them My Bins.
OK, there's the count.
What I'd like to do is I would like to look--
we've got to pick some bar to look at.
Let's look at this one.
The 40 to 65 bar--
how many movies in that bar?
52-- you agree?
Total movies was 200.
So let's start doing in 40, 40 to 65--
no, I don't want a hard bracket there, but bin.
There are 52 out of 200 movies.
So percent is 52 divided by 200.
And I actually want a percent--
so how about we multiply by 100.
Looking good?
And so percent is 26%, which you have all done in your head.
The width of the interval is right point minus left point.
So 65 minus 40.
So percent area divided by width should give me 26 over 24,
and that is--
no, sorry, 26 over 25--
and that is 1.04.
Should we look at that bar?
This is this bar.
The 40 to 65 bar is right here.
You see it's just a smidgen over 1, the height, that's 1.04.
OK, so I will stop now, and I will take questions
about the drawing of the height--
about the calculation of the height-- and then
we're going to interpret it.
OK.
Somewhere in your notes, or your phone, or whatever--
wherever you are taking a record of this--
please put a note, remember how to calculate histogram heights.
Because the temptation is to put the count, or the percent
there, and it ain't so-- you have to divide by the width.
And then, of course, we have to look at units,
which we will do right now.
You can see that this is--
wait a minute, what happened here?
OK, the calculation that we did is 26% divided by 25 years.
Which is why the unit is percent per year.
For each year, there is 26% of the data in that bar--
sorry, for each year there is 1.04% of the data.
So that's why percent per year here, and that just simply
tells you why those words are written there.
We will interpret what percent per year
means as a physical thing in just a moment.
So to find the height of the second bar,
I think it had 18 movies in it.
Maybe it was 17, let me see.
Where are my bins?
OK, the 10 to 15 bar had 18 movies in it.
That's how many percent of 200?
9% of 200.
What's the width of the bar?
Five units, yes?
10 to 15.
So 9 divided by 5, the height of the bar should be 1.8,
and that's what you're seeing.
That's the 1.8 right there.
So 1.8% per year.
So that's how you calculate the heights.
That is a summary of the calculation that we just did.
OK, I've just written out what we did in the notebook.
OK, so now what does height measure?
Percent per year-- percent is the area,
the amount of stuff in the bar.
Year is what's happening along the horizontal axis.
What you need to do now is to think of your bar
and look at the bin, and look at individual years
within the bin.
That height of 1.04 is telling you
that in each individual year, there is, on average,
1.4% of the data--
1.4% per year in that bar.
So the amount of movies in the bin, relative to the size
of the bin.
And so therefore, it is called a density.
It is not measuring how many movies there are,
but rather how crowded the movies are in the bin.
Let's just go back and take a look at--
let's see.
OK, 40 to 65 bin has 52 movies in it.
The 25 to 40 bin-- the one to the left-- has 40 movies in it.
52 movies, 40 movies-- you notice, fewer movies,
but taller.
Yeah, that is your signal that the height is not
counting how many movies.
It's counting something else.
It's counting how many movies per unit space here.
And so while this bin does have more movies,
it has correspondingly even more space.
So those movies have more elbow room--
you just imagine lining them up here,
they just have more elbow room.
So they're less crowded.
And so the height is lower.
Ok, area measures percent-- we've said that over and over again.
If you want to discuss how many individuals there are in a bin,
you're going to look at area.
If you want to discuss how crowded is the bin,
where is the most action per unit length
on the horizontal axis?
Then you are going to look at height.
I want to leave you with a comparison.
Which do you draw?
You've got distribution, you're trying
to draw a representative distribution, which
do you draw?
A bar chart or a histogram?
Bar chart-- categorical variable.
The bars can have arbitrary widths,
because nothing is numerical there.
You'd get to choose.
You can have spacings, you get to choose.
And the height or the length, if you are doing it horizontally,
is proportional to how many you have in that category.
Histogram, numerical variable-- because numerical variable,
the horizontal axis must be to scale.
No gaps.
And you are allowed to have unequal bins--
that's the key.
You follow the area principle, and calculate the heights
accordingly.
All OK?
Have a great weekend.
I'll see you Monday.

For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 13 - Duration: 11:04.
For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 3 - Duration: 6:36.
For more infomation >> trial video - Duration: 4:05.
For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 11 - Duration: 10:12.
For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Third Reading - Video 5 - Duration: 7:03.
For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 4 - Duration: 7:54.
For more infomation >> 💖💖New WhatsApp Status Video 2018 💖💖|Pehla Pehla Pyaar song - Duration: 0:31.
For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 2 - Duration: 11:32. 
For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 12 - Duration: 2:38. 
Không có nhận xét nào:
Đăng nhận xét