相关性检验 (CorrelationTests)

相关性检验类提供了6种方法来评估变量间的关联性和独立性。这些检验在数据探索、特征选择、依赖性分析等方面有重要应用。

class CorrelationTests

相关性检验类，包含多种评估变量关联性的统计方法。所有方法都返回 TestResult 对象。

主要方法:

检验方法详解

Pearson相关检验

pearson_correlation(x: np.ndarray | List, y: np.ndarray | List, alpha: float = 0.05) → TestResult

Pearson积矩相关检验，评估两个连续变量间的线性相关性。

参数:

x: 第一个变量的数据
y: 第二个变量的数据
alpha: 显著性水平，默认0.05

适用条件:

两变量均为连续型数据
数据近似服从双变量正态分布
线性关系假设

原假设: 两变量间无线性相关关系 (ρ = 0)

应用场景:

连续变量间线性关系评估
特征选择中的冗余性检测
回归分析前的变量关系探索

示例:

>>> from pywayne.statistics import CorrelationTests
>>> import numpy as np
>>>
>>> ct = CorrelationTests()
>>> x = np.random.normal(0, 1, 100)
>>> y = 2 * x + np.random.normal(0, 0.5, 100)  # 有相关性
>>> result = ct.pearson_correlation(x, y)
>>> print(f"相关系数: {result.statistic:.4f}, p值: {result.p_value:.4f}")

Spearman秩相关检验

spearman_correlation(x: np.ndarray | List, y: np.ndarray | List, alpha: float = 0.05) → TestResult

Spearman秩相关检验，评估两个变量间的单调关系。

参数:

x: 第一个变量的数据
y: 第二个变量的数据
alpha: 显著性水平

适用条件:

数据至少为序数水平
无需正态性假设
评估单调关系（不限于线性）

原假设: 两变量间无单调关系

应用场景:

非正态数据的相关分析
序数变量的关联分析
非线性单调关系检测

示例:

>>> # 非线性但单调的关系
>>> x = np.random.uniform(0, 10, 100)
>>> y = x**2 + np.random.normal(0, 5, 100)  # 非线性关系
>>> result = ct.spearman_correlation(x, y)
>>> print(f"Spearman系数: {result.statistic:.4f}")

Kendall τ相关检验

kendall_tau(x: np.ndarray | List, y: np.ndarray | List, alpha: float = 0.05) → TestResult

Kendall τ相关检验，基于一致对和不一致对的秩相关。

参数:

x: 第一个变量的数据
y: 第二个变量的数据
alpha: 显著性水平

适用条件:

序数或连续数据
对异常值相对稳健
小样本表现良好

原假设: 两变量独立

应用场景:

小样本相关分析
存在异常值时的稳健相关
序数数据的关联分析

示例:

>>> # 包含异常值的数据
>>> x = np.concatenate([np.random.normal(0, 1, 95), [10, -10]])
>>> y = np.concatenate([np.random.normal(0, 1, 95), [12, -8]])
>>> result = ct.kendall_tau(x, y)
>>> print(f"Kendall τ: {result.statistic:.4f}")

卡方独立性检验

chi_square_independence(data: np.ndarray | List[List], alpha: float = 0.05) → TestResult

卡方独立性检验，检验分类变量间的独立性。

参数:

data: 列联表数据（二维数组）
alpha: 显著性水平

适用条件:

分类变量
期望频数 ≥ 5（80%以上的单元格）
样本量足够大

原假设: 两个分类变量相互独立

应用场景:

分类变量关联性分析
市场研究中的偏好关联
医学研究中的风险因素分析

示例:

>>> # 2x2列联表：性别 vs 产品偏好
>>> contingency_table = [[30, 20], [15, 35]]  # 行：性别，列：偏好
>>> result = ct.chi_square_independence(contingency_table)
>>> print(f"卡方统计量: {result.statistic:.4f}, 独立性: {not result.reject_null}")

Fisher精确检验

fisher_exact(data: np.ndarray | List[List], alternative: str = 'two-sided', alpha: float = 0.05) → TestResult

Fisher精确检验，用于2×2列联表的精确独立性检验。

参数:

data: 2×2列联表
alternative: 备择假设类型（’two-sided’, ‘less’, ‘greater’）
alpha: 显著性水平

适用条件:

2×2列联表
小样本或期望频数过小时
精确检验方法

原假设: 两个二分类变量独立

应用场景:

小样本的2×2表分析
临床试验的二分类结果
期望频数不满足卡方检验条件时

示例:

>>> # 小样本的2x2表
>>> table_2x2 = [[8, 2], [1, 5]]
>>> result = ct.fisher_exact(table_2x2)
>>> print(f"Fisher精确检验 p值: {result.p_value:.4f}")

McNemar检验

mcnemar_test(data: np.ndarray | List[List], correction: bool = True, alpha: float = 0.05) → TestResult

McNemar检验，用于配对二分类数据的边际频率比较。

参数:

data: 2×2列联表（配对数据）
correction: 是否使用连续性校正，默认True
alpha: 显著性水平

适用条件:

配对二分类数据
同一对象的前后测量
关注边际概率变化

原假设: 边际概率相等

应用场景:

治疗前后的二分类结果比较
同一群体的态度变化分析
配对设计的有效性评估

示例:

>>> # 治疗前后的成功/失败配对数据
>>> # [治疗前成功&治疗后成功, 治疗前成功&治疗后失败]
>>> # [治疗前失败&治疗后成功, 治疗前失败&治疗后失败]
>>> paired_table = [[25, 5], [15, 10]]
>>> result = ct.mcnemar_test(paired_table)
>>> print(f"McNemar检验 p值: {result.p_value:.4f}")

使用建议

方法选择指南

数据类型考虑:
- 连续数据: Pearson相关（正态分布）、Spearman相关（非正态）
- 序数数据: Spearman相关、Kendall τ
- 分类数据: 卡方检验、Fisher精确检验
分布假设:
- 满足正态性: Pearson相关
- 不满足正态性: Spearman相关、Kendall τ
- 无分布假设: 非参数方法
样本量考虑:
- 大样本: 所有方法均适用
- 小样本: Kendall τ、Fisher精确检验
- 期望频数小: Fisher精确检验
研究设计:
- 独立样本: Pearson、Spearman、卡方
- 配对样本: McNemar检验

数据准备建议

数据清洗: - 处理缺失值和异常值 - 确认数据类型正确 - 检查数据分布特性
变量转换: - 必要时进行标准化 - 分类变量适当编码 - 考虑非线性变换
样本量评估: - 计算所需样本量 - 评估检验功效 - 考虑效应量大小

典型应用示例

数据探索性分析

from pywayne.statistics import CorrelationTests
import numpy as np
import pandas as pd

# 模拟多变量数据集
np.random.seed(42)
n = 100

# 生成相关的变量
x1 = np.random.normal(0, 1, n)
x2 = 0.8 * x1 + np.random.normal(0, 0.6, n)  # 与x1强相关
x3 = np.random.uniform(0, 10, n)
x4 = x3**0.5 + np.random.normal(0, 0.5, n)   # 与x3非线性相关

ct = CorrelationTests()

# 构建相关矩阵
variables = {'x1': x1, 'x2': x2, 'x3': x3, 'x4': x4}
var_names = list(variables.keys())

print("相关性分析结果:")
print("=" * 50)

for i in range(len(var_names)):
    for j in range(i+1, len(var_names)):
        var1, var2 = var_names[i], var_names[j]

        # Pearson相关
        pearson_result = ct.pearson_correlation(variables[var1], variables[var2])

        # Spearman相关
        spearman_result = ct.spearman_correlation(variables[var1], variables[var2])

        print(f"\n{var1} vs {var2}:")
        print(f"  Pearson:  r={pearson_result.statistic:.3f}, p={pearson_result.p_value:.3f}")
        print(f"  Spearman: ρ={spearman_result.statistic:.3f}, p={spearman_result.p_value:.3f}")

分类变量关联分析

# 模拟分类数据：教育水平 vs 收入水平
# 教育水平：高中以下、高中、大学、研究生
# 收入水平：低、中、高

# 列联表数据 (行：教育水平，列：收入水平)
education_income_table = [
    [40, 35, 15],   # 高中以下
    [25, 45, 30],   # 高中
    [15, 40, 45],   # 大学
    [5,  20, 25]    # 研究生
]

ct = CorrelationTests()

# 卡方独立性检验
chi2_result = ct.chi_square_independence(education_income_table)

print("教育水平与收入水平关联分析:")
print(f"卡方统计量: {chi2_result.statistic:.4f}")
print(f"p值: {chi2_result.p_value:.4f}")
print(f"结论: {'教育水平与收入显著关联' if chi2_result.reject_null else '教育水平与收入无显著关联'}")

# 计算Cramér's V系数（关联强度）
n = np.sum(education_income_table)
chi2_stat = chi2_result.statistic
min_dim = min(len(education_income_table) - 1, len(education_income_table[0]) - 1)
cramers_v = np.sqrt(chi2_stat / (n * min_dim))
print(f"Cramér's V: {cramers_v:.3f} (关联强度)")

小样本精确检验

# 临床试验小样本数据：治疗方法 vs 治疗结果
# 2x2表：[有效&无效] x [新药&对照]

treatment_result = [
    [12, 3],   # 新药：有效12例，无效3例
    [6, 9]     # 对照：有效6例，无效9例
]

ct = CorrelationTests()

# Fisher精确检验
fisher_result = ct.fisher_exact(treatment_result)

# 卡方检验（对比）
chi2_result = ct.chi_square_independence(treatment_result)

print("小样本临床试验分析:")
print(f"Fisher精确检验 p值: {fisher_result.p_value:.4f}")
print(f"卡方检验 p值: {chi2_result.p_value:.4f}")
print(f"新药疗效: {'显著优于对照' if fisher_result.reject_null else '与对照无显著差异'}")

# 计算效应量：优势比(Odds Ratio)
a, b, c, d = treatment_result[0][0], treatment_result[0][1], treatment_result[1][0], treatment_result[1][1]
odds_ratio = (a * d) / (b * c)
print(f"优势比: {odds_ratio:.2f}")

配对数据的变化分析

# 治疗前后的改善情况（二分类）
# McNemar检验示例

# 生成配对数据：治疗前状态 vs 治疗后状态
np.random.seed(42)
n_patients = 100

# 模拟治疗前后的改善/未改善状态
# 假设治疗有一定效果
pre_improved = np.random.binomial(1, 0.3, n_patients)  # 治疗前30%改善

# 治疗后改善概率增加
post_improved = np.zeros(n_patients)
for i in range(n_patients):
    if pre_improved[i] == 1:
        post_improved[i] = np.random.binomial(1, 0.9)  # 已改善者90%维持
    else:
        post_improved[i] = np.random.binomial(1, 0.5)  # 未改善者50%改善

# 构建McNemar表
mcnemar_table = [
    [np.sum((pre_improved == 1) & (post_improved == 1)),   # 治疗前后都改善
     np.sum((pre_improved == 1) & (post_improved == 0))],  # 治疗前改善后退步
    [np.sum((pre_improved == 0) & (post_improved == 1)),   # 治疗前未改善后改善
     np.sum((pre_improved == 0) & (post_improved == 0))]   # 治疗前后都未改善
]

ct = CorrelationTests()
mcnemar_result = ct.mcnemar_test(mcnemar_table)

print("治疗前后改善情况分析 (McNemar检验):")
print("配对表:")
print(f"  治疗前后都改善: {mcnemar_table[0][0]}")
print(f"  治疗前改善后退步: {mcnemar_table[0][1]}")
print(f"  治疗前未改善后改善: {mcnemar_table[1][0]}")
print(f"  治疗前后都未改善: {mcnemar_table[1][1]}")
print(f"\nMcNemar统计量: {mcnemar_result.statistic:.4f}")
print(f"p值: {mcnemar_result.p_value:.4f}")
print(f"结论: {'治疗后改善比例显著变化' if mcnemar_result.reject_null else '治疗前后改善比例无显著变化'}")

特征选择应用

# 特征选择中的相关性分析
from sklearn.datasets import make_regression

# 生成回归数据集
X, y = make_regression(n_samples=200, n_features=10, n_informative=5,
                      random_state=42, noise=0.1)

ct = CorrelationTests()

print("特征与目标变量的相关性:")
print("=" * 40)

correlations = []
for i in range(X.shape[1]):
    pearson_result = ct.pearson_correlation(X[:, i], y)
    correlations.append({
        'feature': f'Feature_{i}',
        'correlation': pearson_result.statistic,
        'p_value': pearson_result.p_value,
        'significant': pearson_result.reject_null
    })

# 按相关性强度排序
correlations.sort(key=lambda x: abs(x['correlation']), reverse=True)

for corr in correlations:
    print(f"{corr['feature']}: r={corr['correlation']:.3f}, "
          f"p={corr['p_value']:.3f}, 显著={'是' if corr['significant'] else '否'}")

# 选择显著相关的特征
significant_features = [corr['feature'] for corr in correlations if corr['significant']]
print(f"\n显著相关特征: {significant_features}")

注意事项

相关性vs因果性: - 相关不意味因果 - 考虑第三变量的影响 - 结合领域知识解释
多重检验校正: - 多个变量对比较时 - 使用Bonferroni或FDR校正 - 控制整体错误率
异常值影响: - Pearson相关对异常值敏感 - 考虑使用稳健方法 - 必要时进行异常值处理
样本量充足性: - 确保有足够的统计功效 - 小相关系数需要大样本 - 进行功效分析

相关性检验 (CorrelationTests)

检验方法详解

Pearson相关检验

Spearman秩相关检验

Kendall τ相关检验

卡方独立性检验

Fisher精确检验

McNemar检验

使用建议

方法选择指南

数据准备建议

相关系数解释

典型应用示例

数据探索性分析

分类变量关联分析

小样本精确检验

配对数据的变化分析

特征选择应用

注意事项