应用统计学方法(二)

发布于 作者: Ethan

Learning Objectives

  • Understand packages and libraries in R and how they work.

  • Learn basics of ggplot2.

  • Calculate and visualize probabilities from:

    • Binomial distribution.
    • Normal distribution.

Key Functions

  • ggplot()
  • dbinom()
  • pbinom()
  • geom_segment()
  • pnorm()
  • stat_function()
  • dnorm()
  • geom_area()

Introduction to ggplot2

  • ggplot2 is a powerful visualization package based on the Grammar of Graphics.

  • Three key components in ggplot():

    1. Data – the dataset.
    2. Aesthetics (aes) – variables mapped to axes, color, size, etc.
    3. Geometric object – the type of plot (geom_histogram, geom_point, etc.).

ggplot Basics Example

Dataset: penguins.csv.

  1. Start with data:

    ggplot(data = penguins)
    
  2. Add aesthetics:

    ggplot(data = penguins, aes(x = body_mass_g))
    
  3. Add geometry:

    ggplot(data = penguins, aes(x = body_mass_g)) +
      geom_histogram()
    

Mnemonic: Data → Aesthetics → Geometric object.


Binomial Distribution in R

Probability Mass Function

  • dbinom(x, size, prob) → P(X = x)

    • x: number of successes.
    • size: number of trials.
    • prob: probability of success.

Example: exactly 3 heads in 4 flips, p=0.5.

dbinom(x = 3, size = 4, prob = 0.5)

Cumulative Distribution

  • pbinom(q, size, prob, lower.tail) → P(X ≤ q) if lower.tail=TRUE.

  • Default: lower.tail = TRUE.

    • TRUE → P(X ≤ x).
    • FALSE → P(X > x).

Example: at most 1 head in 4 flips.

pbinom(q = 1, size = 4, prob = 0.5)

Binomial Visualization

  • Create a data frame with outcomes and probabilities:

    df1 <- data.frame(x = 0:4,
                      y = dbinom(0:4, size = 4, prob = 0.5))
    
  • Plot with geom_segment:

    ggplot(df1, aes(x = x, xend = x, y = 0, yend = y)) +
      geom_segment() +
      labs(title = "Binomial(4, 0.5)",
           x = "Number of Heads",
           y = "Probability")
    

Reminder: Always add informative title + axis labels.


Normal Distribution in R

Cumulative Probabilities

  • Function: pnorm(q, mean, sd, lower.tail).

  • Example: P(X < 50) when mean=80, sd=15.

    pnorm(q = 50, mean = 80, sd = 15)
    
  • For the standard normal distribution (Z): mean=0, sd=1 (default).


Visualizing Normal PDF

  • Use stat_function() with dnorm():

    ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
      stat_function(fun = dnorm, args = list(0, 1))
    
  • Improved with labels and formatting:

    ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
      stat_function(fun = dnorm, args = list(0, 1),
                    col = "black", lwd = 1) +
      labs(title = "Normal(0, 1)", x = "", y = "Density")
    

Shading Areas in Normal Distribution

  • Use geom_area() to shade regions.

  • Example: Shade region where Z ≤ 1.4.

    ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
      geom_area(stat = "function", fun = dnorm,
                args = list(0, 1),
                fill = "lightblue", xlim = c(-4, 1.4)) +
      stat_function(fun = dnorm, args = list(0, 1),
                    col = "black", lwd = 1) +
      labs(title = "Normal(0, 1)", y = "Density")
    

Key: specify xlim for shaded area bounds.


Uniform Distribution in R

  • Function: punif(q, min, max, lower.tail).

  • Example: Wait time between 20–70 minutes. P(X ≥ 30).

    punif(q = 30, min = 20, max = 70, lower.tail = FALSE)
    

General Reminders

  • Always install once, load per session for packages.

  • Binomial distribution: use dbinom() for exact probabilities, pbinom() for cumulative.

  • Normal distribution: use pnorm() for probabilities, dnorm() for density plotting.

  • Uniform distribution: use punif().

  • Visualizations should always include:

    • Title.
    • Clear axis labels.
    • Professional style suitable for reports.