Thứ Tư, 21 tháng 2, 2018

Waching daily Feb 22 2018

PROFESSOR: Morning, everybody.

Thank you for being here.

Happy Friday.

Today is a lecture on histograms.

Before that, the usual announcements,

of which there really aren't any other than--

well, I'm looking at that page, and it's not yet updated.

But homework two was due yesterday,

so that's not really an announcement.

Homework three will be released today,

and then it's the usual Thursday deadline,

but you get a point for turning it

in on Wednesday-- nothing new there.

Next Friday-- next Friday is special.

Next Friday, we will release project one--

or at least we'll discuss the release of project one.

And it's even more special because one

of my fellow instructors is going

to be here to give that lecture, John De Niro

will be lecturing on Friday.

And he is a legend, you should absolutely hear him.

I've had more years teaching than him,

but every time I watch him I learn something new.

So please show up.

All right, that was announcements, short and sweet.

Here's the plan for today--

You now have many table methods. You

know how to use data in tables in quite varied ways.

Today, we are going to talk about distributions--

I will define what that means in just a moment--

and we'll talk about visualizing them,

both numerical and categorical.

And so before we get started on those, let's

just remind ourselves of some terminology.

Types of data-- we discussed this briefly last time--

data can be of many types.

Two common types are numerical and categorical.

So numerical is just data that are numbers--

heights, weights, temperatures, number of students in a class,

all of these.

And of course, being numbers, they

are subject to the rules that govern numbers.

They have relations to each other--

15.2 is bigger than 13, it's not less, and so on.

Categorical is another common type where the data aren't

numbers, they are categories.

So it could be your favorite color, ethnicity,

which year you are in school--

so freshman, sophomore, junior, senior.

And let's see-- not satisfied, somewhat satisfied, highly

satisfied, OK, those are all categorical.

Now that last one has a natural ordering,

as does freshman, sophomore, junior, senior.

But vanilla, chocolate, strawberry doesn't necessarily

have an ordering, unless you are very particularly

interested in one of them.

So these are two common types of data.

There are many other types of data-- data can be maps,

they can be music, they can be anything.

But these are two common types.

Yes?

STUDENT: Are data like zip codes numerical or--

PROFESSOR: Are data like zip codes numerical or categorical?

That is an excellent question.

Let's try to answer it.

So if I look at zip code 94720, that's a number,

it's an integer.

Yes?

But then let's look at another zip code--

94704, that's also a number, yes?

Does it make sense to take the difference of the two?

Well, you can minus, you'll get another number,

but that number doesn't have an interpretation.

So in some sense, they are just labels.

And it's a short way of saying, it's this region in Berkeley.

So that is a good point that some data that

look numerical are actually categorical,

and we saw that last time with-- you know,

there was sex was classified as zero, one, two, or something

like that.

OK, so some terminology that we've been using throughout.

Individual is a unit whose features are being recorded.

A variable, we've used several words--

variable, feature, attribute.

And across different individuals,

a variable has different values.

So if the variable is favorite color, then

across different people, the value of the color

will be different.

Variables can be numerical or categorical subtypes

within these, as we have discussed.

And that should say, and many other types besides.

But we are going to focus, especially today,

on numerical and categorical variables.

The key is, when you look at an individual,

the categories are defined so that the individual fits

in one and only one category.

No blurring, no overlapping, that

makes counting and organization complicated.

Ok. So what's a distribution?

What I'd like you to imagine is the individuals

are all the students in this class.

The variable is their year in school--

so freshman, sophomore, junior, senior,

why don't we call graduate students just graduate.

So the variable is the year in school,

the values of the variable are freshman, sophomore,

junior, senior, graduate.

Distribution takes each value, like freshmen,

and counts how many students are corresponded to that value.

So how many freshmen, how many sophomores,

how many juniors, how many seniors,

how many graduate students?

It says how the class is distributed over the values

of the variable.

So in a distribution, each person

should appear once and only once in the categories,

and everybody should appear.

And so what we're going to do is we're

going to start with a categorical distribution--

a distribution of a categorical variable.

And we are going to visualize it.

And you have seen these visualizations

since you were little, we're just giving names to things

that you know.

And we're going to start out with a bar chart.

And I don't know what it is, when

we developed this material, we must have

been in some real movie kick.

OK, so these are the 200 top grossing movies

up to the year 2017.

And let me be clear about what top grossing means--

it's how much ticket revenue there was.

Now, you can see that Gone with the Wind is a pretty old movie,

it was released in 1939.

Ticket prices were nice and low at that stage--

so comparing the total dollar amount

that was taken in for tickets in 1939

to the total dollar amount that is taken in by a movie

now, when we're all paying upwards of $10

to go see a movie, is not fair.

So that column, gross, is number of dollars of tickets

that was sold in that year.

The column gross adjusted is that same number of tickets,

but sold at 2017 prices.

OK, so basically it's a comparison

of how many tickets were sold, and what the ticket price was.

And you can see that on that scale,

Gone with the Wind just blows everybody

else out of the water.

So the rows have been sorted in decreasing order

of the adjusted gross amount.

I'll pause for a second, I'll let you take a look.

Questions about what the rows represent?

All right, so the individuals are movies, and there is studio--

which is a categorical variable--

there are gross and gross adjusted,

which are numerical variables.

The year, you can think of as a label,

but also, it's reasonable to do some arithmetic on them.

So you can take a difference of years,

and you get the amount of time in between.

So we will, today, think of it as a numerical variable.

What I want to do is I want to look

at this column called studios.

So for each movie, it's the studio that released the movie.

And you can see that Fox appears twice,

Paramount appears twice, Disney, surely,

if you look at other movies it's going to appear more than once.

And so that variable, studio, has a distribution.

There are certain distinct studios,

and there's a count of how many movies were each studio,

and we're going to try to first get hold of that distribution.

So what we'll do is just work with the studios.

Each movie corresponds to one studio,

that's the same as saying each row corresponds to one studio.

And I just want to see, for each studio,

how many times did it appear in this data set?

And there's a method that does that.

It's called Group.

And here's what Group does--

this is just an introduction to Group,

next week, we'll do Group in all its detail.

It takes all the rows, groups them by the distinct value--

so it'll put everything that says Fox in one lot,

and then count how many there are.

So Group did the counting for you.

It said, the studios are ABC, old Buena Vista--

that's a Disney thing--

Columbia, Disney, Dreamworks, you

can see that it's put them in alphabetical order.

One of the movies was released by ABC, or 35

by Buena Vista, nine by Columbia,

and so on-- it's done the counting for you.

Let me give this thing a name.

All right, so that's fine.

So now we can see at one go what were all the distinct studios,

and how many movies came from each studio.

But that is a little hard to see,

so it's a great idea to draw a picture.

And what we're going to do is we're going to take this,

and we're going to draw a bar chart.

Why do I keep typing a dot?

I should know better.

OK bar H, because it is going to be a horizontal bar

chart, for reasons I will explain in just a moment.

Here's what you have to give bar H. You have to say,

OK, these guys are all the distinct values

of the categorical variable.

This is the variable I want summarized, so give it

the label of the column.

And you say, use these as the categories,

and these to help you draw the length of the bars.

So you just give it the name of your categorical variable.

And it will use the other as the length of the bar.

And if I do this, I get that.

So this says however many studios it has,

and a bar for each studio, and it's

a way of getting a visualization of that distribution

that we had.

And a couple of things to note here--

a horizontal bar graph because, if you had it vertical,

labeling is difficult. The names of the studios

run into each other.

The names of the studios are long,

and so by the time you have Paramount and Paramount

Dreamworks, putting them next to each other,

they'll just overlap.

So horizontal bar chart.

OK, all right, important thing to note--

on the horizontal axis, you have a numerical variable.

It's the count-- it's how many movies were released.

On the vertical axis, you don't have numbers,

you have categories.

That means one thing for a start.

There is no particular order to these.

They are in alphabet order just because that's what Group did.

But there is no reason for them to be in alphabetical order.

In fact, putting them in alphabetical order

makes the graph difficult to understand.

What's the natural thing to do here?

Sort-- sort by what?

Sort by descending, or in descending order

of number of movies.

So if we go back and take our table,

and sort by the counts column--

that is a much easier to interpret graph.

So, I mean, you can see that Buena Vista had

the most, and then Warner Brothers,

and so on, and so forth.

Crucial thing to note is we were able to permute--

that is rearrange the bars--

because they are categorical.

There is no natural ordering.

With numbers, you can't do this.

You can't suddenly put five below three.

The other thing to note is these bars have widths.

And I don't know how many of you have thought

about the width of bars in bar chats,

but I think you understand now once you

are thinking about them that it's

entirely up to the designer.

There's no numerical reason for the bar

to be of a certain width--

the width is usually decided based

on appearance, how much fits on a screen,

and so on and so forth.

So it's basically a designer's choice, what the width is.

So I also noticed there are gaps between the bars.

How wide those gaps are is also designer's choice.

And it makes sense that all the bars have the same width,

and all the gaps are the same, because otherwise, you

give more emphasis to one rather than the other.

And that's so silly.

And why am I noticing these rather obvious things?

Because when you get to numbers, you

don't have any of these choices.

And you have to make your choices that

are consistent with the order of the numbers.

All right, questions thus far?

OK.

When we get to numerical variables,

there is one thing that happens immediately.

So now, imagine that your variable is height.

So one person is 68.3 inches tall,

somebody else is 62.5 inches tall,

somebody else is 72.1 inches tall, and so on.

So there is a value for every individual,

just as there was a value for every individual here--

every movie came from a studio.

Now, it's very easy to assign the movie to the right group--

you just use the name of the studio,

you stick it into that group.

With numerical variables, you might not

want the level of detail that says,

how many people were 68.3 inches tall?

How many people were 68.4 inches tall?

How many people were 68.5 inches tall?

You might want to just say, OK, look,

I'm going to look at how many people are between 68

and 70 inches tall, and then 70 and 72, and 60 and 65,

whatever.

Right?

So that choice is up to you.

Those are called bins--

you put the values in bins.

So unlike categorical variables, or numerical variables,

you have to start with what is called binning.

That is a summary of what we just

did for the categorical variable--

you've got a distribution, which you get by using Group.

And you display a bar chart, one bar for each category.

And the length of the bars is the number

of individuals in the category.

Numerical variable, you start with binning.

And so we have to look at binning

in a little bit of detail, so let's do that now.

If you bin numerical values, what you're doing

is each bin is a range from here to there,

and you are counting the number of individuals in the range.

And always, there is an issue about the endpoints--

what happens to the people at the end points?

And so there are conventions that are used.

And the convention that we are going to use

is consistent with Python's convention,

that every range includes the left end, but not the right.

So if that's a data set, and I have decided to use these bins,

how do you start binning those values?

Well, 188 goes there.

And I want you to look at that notation-- that's

the 185 to 190 bin.

Do you see the square bracket at the 185?

And the parentheses at the 190?

Yes?

That is math code for the 185 is actually

included in that interval, but 190 is not.

So often you will see closed at the left end,

open at the right.

All right, so the next one is 170.

Could you please talk to your neighbor

and figure out where is it going to go?

Is it going to go to the left of 170,

or is it going to go to the right of 170?

Quickly.

So here, or here?

All right, votes for here?

Votes for here?

All right, and there are some people who

just will never vote at all.

Yeah, OK, so you're doing well.

170 is there, because the 170 to 175 interval includes the 170.

The 165 to 170 interval does not include the 170,

by our convention.

You can have the opposite convention if you wish,

but the methods that you are going to be using here

in the Data Science Library use this convention,

it's consistent with Python.

OK, so here we go.

Basically stacking the bricks, one above the other.

And can you see something that's awfully like a bar chart?

That is emerging very much like a bar chart,

and so we have to be able to just do this in general,

without us having to go through one at a time, counting values.

Agreed?

All right, so we are going to now-- the first thing we're

going to do is we are going to get

Python to do the binning for us, so we

don't have to go through this.

And then we'll see how to do the visualization.

So what I like to do is to construct

a numerical variable-- that is the ages of the movies.

Let me remind you what the data set looks like.

So you've got the year in which they were released,

so we'll just take 2018, subtract off the year,

and we'll get the age.

And what I'd like to do is I'd like

to look at the distribution of the ages of these movies.

So I get to choose bins.

And so, in order to choose the bin,

it's a great idea to know which was the oldest one,

and which was the newest one.

Or not which one, but how old was

the youngest and the oldest.

So if I run that--

the youngest one was one year old.

What was the big, huge movie that was released last year,

do you think?

Yeah, Star Wars, it's always Star Wars.

All right, and then one of them is 97 years old.

So I chose some bins.

OK, so bins are specified as an array.

An array of what?

An array of the left end points.

So zero to five, five to 10, 10 to 15, 15 to 25, et cetera.

100 is a curiosity--

I'm ending at 100.

What happens beyond 100, we'll discuss in just a moment.

But when you see bins as an array, it should be increasing.

And they represent the left endpoints

of all your intervals.

All right, so let's bin our movies--

top dot, and the method is called Bin.

You have to say you are binning according to what?

We are binning according to the age column.

And what are the bins?

They are in the array that I have called My Bins.

So two arguments-- the label of the column,

which is your variable, and the array that are your bins.

What have I done?

Did I not run my bins?

I probably didn't run my bins.

Yep-- OK.

That's the difficulty of having typed your code in ahead

of time.

All right, so what happened here?

We need to learn to read this table.

That first line is not a line by itself--

it actually includes the five of the second line.

That first line says, for the bin whose left end is zero,

there are 21 movies in that bin.

All right, if the left end is zero, where will the right end?

Five, right?

So the ages that are counted in the first bin

are zero, one, two, three, four--

not five.

Five is counted in the next one.

OK so far?

And so if you go down, that's the whole lot.

And there is this curiosity at 100--

I just want to make sure that--

there were 200 movies in all.

What should the answer be?

What am I adding up?

Which column?

Age count, yes?

If I add them up, what should I get?

Every movie appears once and only once in there.

I should get the number of movies.

There were 200 movies in all, let's see.

Good.

So what you're seeing is that this is a distribution.

Every movie appears once, and exactly once in there.

OK, this 100 you should think of as the right end

of the last bar, 65 to 100.

That zero says, I'm not counting anything beyond that.

In fact, there is nothing beyond that, we know.

The largest one is 97.

So typically, your last and your end will be chosen in that way,

slightly beyond your last one.

And somebody is going to ask, well, what

if there were something at 100?

We'll see in just a moment.

So I chose my bins-- notice, it's up to me

to choose my bins.

I chose my bins so that they're not equally wide--

you see that?

Zero to five is five years wide, five to 10 is five years wide,

10 to 15 is five years wide, but then, 15 to 25

is 10 years wide.

And 25 to 40 is 15 years wide, and so on and so forth.

And with numerical data, bins don't have to be equal.

And this is the crucial thing that

distinguishes the display of numerical data distribution

to a bar chart.

Bins do not have to be equal.

And you have to take that into account when we do the display,

as you will see.

But you could have the bins to be equal.

You can specify the bins to be sequential-- a range of values

separated by a step.

How about we do zero--

I just have bins of width 25.

So I did a recount--

zero to 25, not including 25, 25 to 50,

not including 50, et cetera, et cetera.

And you have another count, you can totally

do that, no problem.

Some of you will be wondering at this point,

why are we doing this fussing at the edge?

Why don't we just say zero to 24?

25 onwards?

It is because numbers don't have to be whole numbers.

So you can have arbitrarily many decimal places,

and then you will get very, very close to edges with your data.

And so this is a way of just taking

care of all kinds of numerical variables at once.

All right, what I want to do is something

that there is really no natural reason for doing,

but if you did it--

so notice that I've always gone beyond the range of the data.

What if, instead, for some reason known only

to the gods of weirdness, I stopped at 60.

Then that range would be--

I would have zero, I would have 25,

I would have 50, and no more.

Agreed?

So it's actually, because of the step,

it's the same as stopping at 50.

If I do that, look what happens.

I want you to notice something--

we stopped at 50.

Bin stopped counting.

We know that there are movies beyond that,

but it's not going to tell us.

That's the first thing to note.

The second thing to note is this 25 to 50 bin, yeah?

And it says that there are 68 elements in it.

Could you go up and look at that 25 to 50 bin?

How many elements are there?

66, it's the same movies.

What is going on?

There is something that has happened here.

And what Bin does is, this last bin

always needs a little bit of extra care.

So what bin does is if there are movies

that are right exactly 50 years old,

it throws them into the last bin.

And so now, if that is true, then looking at this 68,

compared to this 66, how many movies

were exactly 50 years old?

Two-- it better be two.

Well, let's check.

There you go.

Those two movies were exactly 50 years old.

If there are values that are exactly

at the edge of the last bar, those

are thrown into the last bin.

So the last bin typically has a different status

than all the other bins.

It includes both end points.

And this has to be done every time you are binning-- you have

to take care of the end bin.

So all I have done is given you the conventions

that are used by the Bin method for counting and placing

the elements into bins.

It's merely a decision that was made

by the people who defined the function,

and it's a very reasonable decision.

And such decisions are made conventionally, whether or not

you're using the system.

Some people like to include the right end and the left,

that's their choice.

Then they have to take care of the first bin.

All right, just a long list of conventions so far.

So this is the method of binning.

So now, what would we want to do?

Well, we would want to say, OK, why don't I

just do my bins again?

So if I do--

so that's a distribution.

And I want to visualize this distribution.

Then I'm going to draw a picture, which

looks awfully like a bar chart.

But before I do that, I have to remember that,

you remember the widths were different?

So I have to be thinking about how that

is going to affect my diagram.

And it is very important, when you

are doing any kind of visualization,

where you're using sort of regions

to represent numerical quantities,

that you are conscious of the area principal.

And let's look at an example.

This is a graphic from Gizmodo.

And it says, the battery size in what was then

the new iPad versus that of the iPad 2,

and the new one was supposed to be 70% larger than the old one.

So let me just remind you what 70% larger means--

if the new one was twice as large as the old one,

it would be 100% larger.

Right?

You have the old one, and another one again.

So 70% larger is not quite that much.

The old one, and much of another one--

70% of an old one. Right?

Now look at those two--

that's supposed to represent the old battery,

and that's supposed to represent the new battery.

I think I can take this new one and fit at least two of them

into that old one.

I'm sorry, I think I can take this small one,

and fit at least two of them into the big one.

Do you agree?

Just visually.

So it does not look to me like that is 70% larger than this.

It looks to me like it's way way, way more larger.

All right?

So what has happened here?

What has happened is you are picking up--

your eye is picking up large as area.

That's what you're seeing as big--

not just length, not just width, but area,

which includes both dimensions.

And I believe what has been done in this graphic

is that they have increased both dimensions by 70%.

And so then the area gets multiplied,

and then it's way bigger.

So if you double both sides, your area

will get multiplied by four.

So that is the thing to avoid.

And the main thing for you to keep in

mind is the area principle-- whatever you are trying

to represent-- if you represent, say, the number 20%

by that one triangle, then representing 40% by those two

triangles is just fine.

That is accurate.

But if you represent 40% by a triangle that

is double both in width and in height, you've messed up.

Because this one has four times the size of that little one

there.

And you can play a little game-- you can take those two

and you can fit it in here, and there are two more

still that you can fit.

So the area principle says that it's

the areas, not the length and the width,

but the areas that should be proportional to the size

that you are trying to represent.

And that's what we are going to use when we draw

what is known as a histogram.

So a histogram is a chart which looks like a bar chart,

but isn't.

It displays a distribution of a numerical variable.

There is a bar corresponding to each bin,

you've already chosen the bins.

And crucially, it uses the area principle--

it is the area of the bar, not the height.

The area of the bar that represents

the percent of individuals in the corresponding bin.

And so we are now going to draw some histograms.

OK, so these are my bins, and these are the interval,

and so for histogram, the method is called Hist.

You specify the numerical variable of interest.

You specify the bins.

And I'm going to use these weird bins that I decided to use.

And you know today, because we're

going to be very interested in exactly what is measuring

what, I'm also going to specify the units of measurement.

So the variable is age, and is measured in years.

So I'm just going to say the unit is year.

And I am going to run this.

So method is Hist.

Required argument, the variable.

And then, optionally, you can specify

your bins and the units.

Looks kind of like a bar graph.

Let's just see what the--

OK, so when you look at a graph, the eye

naturally goes to the vertical.

What I'm going to ask you to do as data scientists

is to discipline yourselves to first start

with the horizontal.

Forget the vertical-- just look at the horizontal axis.

What's happening here?

There are intervals-- these three are skinny,

then they are wider, and so on and so forth.

So this is the number line.

Why has it been broken up like that?

It has been broken up according to these weird bins.

The first three bins have the same size,

and the rest of wider.

All OK?

That's the horizontal axis.

It is drawn to scale.

These three bins, of width five, they're all equal.

This is of width 10, it's double this one.

You don't have the option of making this bin the same width

as this one.

There are numbers-- so it's twice as wide.

And then there is a rectangular bar

corresponding to the percent of movies in each bin.

And let's ignore the vertical axis,

I just want you to look at a few numbers.

This is the five to 10 bin, and this is the 65 and up bin.

So in the five to 10 bin, there are 17 movies.

In the 65 and up bin, there are 15 movies about the same.

Do you see that this rectangle looks

very much like this rectangle?

Maybe it's a little bit bigger.

The one to the right is on its side.

But you see that the areas are roughly the same?

Yes?

All right, so these bars are representing the areas

in a natural way--

It's the way that we want them to be represented.

But then there's this weirdness here, percent per year.

And people are going, what?

Why percent per year?

Why not just plot the counts?

Just plot how many movies there are in each bin.

So I am going to do that, and it is

going to give me the heebie jeebies,

and I'm going to freak out.

And then I'm going to quickly redraw the diagram properly.

But I'm going to do what people wants done,

which is just draw the counts.

Why are you fussing with the vertical axis?

To do that, I need the same call.

What we've done is that this histogram has been normalized

so that the total area is 100%.

It is following the area principle.

But you can say, you know what?

I don't want any of that.

Don't normalize it.

Just give me the counts--

don't do percents, just give me counts.

And now you have that thing.

All right?

I'm not even going to face the back, because it's just, ugh.

Why am I freaking out in this way?

I am freaking out in this way, because I

am looking at this thing that is supposed to represent

15 movies, and comparing to this thing that is supposed

to be representing 17 movies--

and it is just wrong.

This is the reason we don't plot the counts-- the bars are

uneven.

And what your eye is picking up as a big bar,

is a bar with high area.

If you forget that the widths are all different,

and just plot the counts, you will get weirdnesses like this.

So that does not represent, visually, 15 and 17, at all.

And so just to make myself feel better, yes, thank you.

This thing is called a histogram.

And the area of each bar is the percent

of individuals in the bar.

Yes?

STUDENT: [INAUDIBLE]

PROFESSOR: So why do we have uneven bars?

That is a good question.

It was my choice.

Usually I use uneven bars when I am less interested

in detail in some regions, and more interested

in some other regions.

So if, for example, I was looking at incomes,

and I was interested in low to middle income,

I would probably have a lot of bars there.

And then beyond that, I didn't care so much,

then I would have wide bars.

All right, this figure is called a histogram.

We know that its horizontal axis is the number line--

it is drawn to scale.

And so you can't have arbitrary gaps in between the intervals,

either.

And that areas represent the percents-- let me

just quickly-- so there's a question of, then,

how did it figure out the heights?

We'll talk about that for the rest of the lecture.

OK, so the axes--

the total area sums to 100%.

I haven't left anybody out, it's everybody.

And we are using percents--

the area of each bar is a percent.

The area represents the percent, that's the area principle.

Horizontal axis is the number line.

The vertical axis-- now what's the vertical axis?

It's a rate.

Let's take a look.

Actually, before I take a look, questions thus far.

Yeah?

STUDENT: [INAUDIBLE]

PROFESSOR: What's the default for bins if we don't set it?

That?

It chooses 10 bins of equal size.

I think it goes min to max, and just divide by 10.

All right, so-- you've got a bar, it's a rectangle.

Yes?

So area is width times height.

You agree?

All right, so for all these bars, the width

we know-- let me actually redraw the histogram.

I know the widths because I set them.

I know the areas, because that was counted by Bin.

I need to figure out the height, and you'll agree that--

yes?

Area is equal to height times width-- so how

do you figure out the height?

You take the area, and you divide by the width.

That's what's happened.

So let's take a look at the distribution again.

So I'm just binning, I'm not doing a histogram.

Bins equals, I was calling them My Bins.

OK, there's the count.

What I'd like to do is I would like to look--

we've got to pick some bar to look at.

Let's look at this one.

The 40 to 65 bar--

how many movies in that bar?

52-- you agree?

Total movies was 200.

So let's start doing in 40, 40 to 65--

no, I don't want a hard bracket there, but bin.

There are 52 out of 200 movies.

So percent is 52 divided by 200.

And I actually want a percent--

so how about we multiply by 100.

Looking good?

And so percent is 26%, which you have all done in your head.

The width of the interval is right point minus left point.

So 65 minus 40.

So percent area divided by width should give me 26 over 24,

and that is--

no, sorry, 26 over 25--

and that is 1.04.

Should we look at that bar?

This is this bar.

The 40 to 65 bar is right here.

You see it's just a smidgen over 1, the height, that's 1.04.

OK, so I will stop now, and I will take questions

about the drawing of the height--

about the calculation of the height-- and then

we're going to interpret it.

OK.

Somewhere in your notes, or your phone, or whatever--

wherever you are taking a record of this--

please put a note, remember how to calculate histogram heights.

Because the temptation is to put the count, or the percent

there, and it ain't so-- you have to divide by the width.

And then, of course, we have to look at units,

which we will do right now.

You can see that this is--

wait a minute, what happened here?

OK, the calculation that we did is 26% divided by 25 years.

Which is why the unit is percent per year.

For each year, there is 26% of the data in that bar--

sorry, for each year there is 1.04% of the data.

So that's why percent per year here, and that just simply

tells you why those words are written there.

We will interpret what percent per year

means as a physical thing in just a moment.

So to find the height of the second bar,

I think it had 18 movies in it.

Maybe it was 17, let me see.

Where are my bins?

OK, the 10 to 15 bar had 18 movies in it.

That's how many percent of 200?

9% of 200.

What's the width of the bar?

Five units, yes?

10 to 15.

So 9 divided by 5, the height of the bar should be 1.8,

and that's what you're seeing.

That's the 1.8 right there.

So 1.8% per year.

So that's how you calculate the heights.

That is a summary of the calculation that we just did.

OK, I've just written out what we did in the notebook.

OK, so now what does height measure?

Percent per year-- percent is the area,

the amount of stuff in the bar.

Year is what's happening along the horizontal axis.

What you need to do now is to think of your bar

and look at the bin, and look at individual years

within the bin.

That height of 1.04 is telling you

that in each individual year, there is, on average,

1.4% of the data--

1.4% per year in that bar.

So the amount of movies in the bin, relative to the size

of the bin.

And so therefore, it is called a density.

It is not measuring how many movies there are,

but rather how crowded the movies are in the bin.

Let's just go back and take a look at--

let's see.

OK, 40 to 65 bin has 52 movies in it.

The 25 to 40 bin-- the one to the left-- has 40 movies in it.

52 movies, 40 movies-- you notice, fewer movies,

but taller.

Yeah, that is your signal that the height is not

counting how many movies.

It's counting something else.

It's counting how many movies per unit space here.

And so while this bin does have more movies,

it has correspondingly even more space.

So those movies have more elbow room--

you just imagine lining them up here,

they just have more elbow room.

So they're less crowded.

And so the height is lower.

Ok, area measures percent-- we've said that over and over again.

If you want to discuss how many individuals there are in a bin,

you're going to look at area.

If you want to discuss how crowded is the bin,

where is the most action per unit length

on the horizontal axis?

Then you are going to look at height.

I want to leave you with a comparison.

Which do you draw?

You've got distribution, you're trying

to draw a representative distribution, which

do you draw?

A bar chart or a histogram?

Bar chart-- categorical variable.

The bars can have arbitrary widths,

because nothing is numerical there.

You'd get to choose.

You can have spacings, you get to choose.

And the height or the length, if you are doing it horizontally,

is proportional to how many you have in that category.

Histogram, numerical variable-- because numerical variable,

the horizontal axis must be to scale.

No gaps.

And you are allowed to have unequal bins--

that's the key.

You follow the area principle, and calculate the heights

accordingly.

All OK?

Have a great weekend.

I'll see you Monday.

For more infomation >> STAT C8 - 2018-02-02 - Duration: 55:00.

-------------------------------------------

Bhai ne aapne hi bahan Ko choda naga karke video viral - Duration: 10:53.

SUBSCRIBE

For more infomation >> Bhai ne aapne hi bahan Ko choda naga karke video viral - Duration: 10:53.

-------------------------------------------

Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 13 - Duration: 11:04.

For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 13 - Duration: 11:04.

-------------------------------------------

Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 3 - Duration: 6:36.

For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 3 - Duration: 6:36.

-------------------------------------------

trial video - Duration: 4:05.

For more infomation >> trial video - Duration: 4:05.

-------------------------------------------

Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 11 - Duration: 10:12.

For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 11 - Duration: 10:12.

-------------------------------------------

Education (Tertiary Education and Other Matters) Amendment Bill - Third Reading - Video 5 - Duration: 7:03.

For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Third Reading - Video 5 - Duration: 7:03.

-------------------------------------------

Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 4 - Duration: 7:54.

For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 4 - Duration: 7:54.

-------------------------------------------

💖💖New WhatsApp Status Video 2018 💖💖|Pehla Pehla Pyaar song - Duration: 0:31.

For more infomation >> 💖💖New WhatsApp Status Video 2018 💖💖|Pehla Pehla Pyaar song - Duration: 0:31.

-------------------------------------------

Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 2 - Duration: 11:32.

For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 2 - Duration: 11:32.

-------------------------------------------

Bhai Bahan Ki Chudai Video - Duration: 6:01.

SUBSCRIBE

For more infomation >> Bhai Bahan Ki Chudai Video - Duration: 6:01.

-------------------------------------------

Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 12 - Duration: 2:38.

For more infomation >> Education (Tertiary Education and Other Matters) Amendment Bill - Second Reading - Video 12 - Duration: 2:38.

-------------------------------------------

"Bill becomes Law" Official Music Video - Duration: 2:02.

Simon: Uh

SImon: Yeah

Simon: Skrrt Skrrt

Simon: Uh

Nick: Bill becomes Law, may be a process with tons of meetings and documents but in actuality it's not.

Simon: not

Nick: And Imma tell you why, teach you the ins-and-outs with nothing being left out, ain't nothing here flaw

Simon: Yeah

Nick: The bill is introduced, gets a number and a title, for ID to be checked out, it can't be excused

Nick: Then it goes to the Committee, it's reviewed and voted on, and if they table it, then the poor Bill's gone

Simon: Gone

If everything's good then it's sent to the House and the Senate for a debate

Simon: Ayy

Some things they change, if one or the other defeats the Bill then it is just dead

Nick: But most of the time, both sides approve with a few changes and it goes to Congress

Simon: Government

Nick: Now it's the time, Senator J-Dawg is gonna finish with his line!

Justin: If the Senate and the House agree, it goes to the President

Simon: President

Justin: The Big Boss signs or doesn't, if it's irrelevant

Simon: Ayy

Justin: If he likes it, there's a new law

Justin: If he doesn't, it goes to the floor

Justin: But Congress can come with a back hand, right to his face and make it a law

Justin: It's all about the vote, two-thirds of the House can make it go

Justin: That's how The Machine works, from a Bill to a Law, it all comes down to the vote

Không có nhận xét nào:

Đăng nhận xét