2.24

2025-02-24

核密度估计

通过 一个知乎回答 学习了核密度估计(Kernel density estimation,KDE),能够用样本估计概率密度函数。

可以通过一个简单的 python 程序展示 KDE 的效果(使用高斯核函数):

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

dataAmount = 10 ** 5
xAmount = 10 ** 4
data = np.random.uniform(-10, 10, dataAmount)
x = np.linspace(-10, 10, xAmount)
kde = stats.gaussian_kde(data)
density = kde(x)

plt.plot(x, density)
plt.show()

输出如下:

KDE example 极限的理想情况是在中间呈现均匀分布,但 dataAmountxAmount 都有限,我们希望在相同计算量的情况下达到最好的效果

计算量

对于计算量,从 KDE 公式上看,复杂度应该是 dataAmount * xAmount ,但测试发现并非如此

startTime = time.perf_counter()

def startCount():
    global startTime
    startTime = time.perf_counter()

def printTime(x:str):
    global startTime
    print(x + ": " + (time.perf_counter() - startTime).__str__())
    startTime = time.perf_counter()
    
for i in range(5, 9):
    j = 9 - i
    dataAmount = 10 ** i
    xAmount = 10 ** j
    data = np.random.uniform(-10, 10, dataAmount)
    x = np.linspace(-10, 10, xAmount)
    startCount()
    kde = stats.gaussian_kde(data)
    kde(x)
    printTime("data amount = {}, xAmount = {}".format(dataAmount, xAmount))

输出:

data amount = 100000, xAmount = 10000: 5.868209599982947
data amount = 1000000, xAmount = 1000: 8.375833499943838
data amount = 10000000, xAmount = 100: 12.967822499922477
data amount = 100000000, xAmount = 10: 20.36547399999108

进一步实验

results = np.full((10, 10), np.nan)
for i in range(1, 10):
    for j in range(1, 11 - i):
        dataAmount = 10 ** i
        xAmount = 10 ** j
        data = np.random.uniform(-10, 10, dataAmount)
        x = np.linspace(-10, 10, xAmount)
        startCount()
        kde = stats.gaussian_kde(data)
        kde(x)
        results[i][j] = time.perf_counter() - startTime

df = pd.DataFrame(results)
df.drop(columns=[0], index=[0], inplace=True)
print(df.to_string(na_rep=''))

输出:

123456789
10.0326730.0002230.002740.0017420.0069120.0630410.6329976.31137478.66403
20.0210710.0002920.0039470.0066380.0681210.6051035.98375960.18651
30.0009990.0015230.0064480.0604060.5849525.90206358.92799
40.0030210.0063190.0573590.5773095.77767958.72213
50.0082810.0647280.5848955.89756859.1081
60.1242680.8799378.68668184.32257
71.67899312.47852123.3277
819.62004156.6667
9361.3757

可以看出, 关于 dataAmount 相比关于 XAmount 的增长实际上更快

视觉效果

def generate_kde(i, j):
    dataAmount = 10 ** i
    xAmount = 10 ** j
    data = np.random.uniform(-10, 10, dataAmount)
    x = np.linspace(-10, 10, xAmount)
    kde = stats.gaussian_kde(data)
    return (x, kde(x))

  

rows, cols = 8, 8
fig, axes = plt.subplots(rows, cols, figsize=(11, 11))

for i in range(rows):
    for j in range(cols):
        if (i + j >= 8):
            axes[i, j].axis('off')
            
        else:
            ax = axes[i, j]
            ax.set_xticks([])
            ax.set_yticks([])
            x, kde = generate_kde(i + 1, j + 1)
            ax.plot(x, kde)
            ax.set_title(f'({i + 1},{j + 1})')

plt.show()

KDEs 可以看出,dataAmountXAmount 的作用大得多!

图书馆

今天翘了三节课,有些愧疚,于是晚八点来图书馆学习。明明学期刚开始,图书馆已经人满为患(虽不及期末周)

2-24 图书馆

给日寄都用 Table of Contents 插件加上了静态目录,然后爽爽学概率论!

的傅里叶变换

\begin{aligned}\mathcal{F}f(s) &= \int_{-\infty}^{+\infty}{f(x)\mathrm{e}^{-2\pi is}dx}\ &= \int_0^{+\infty}{\mathrm{e}^{-\beta x}\mathrm{e}^{-2\pi isx}dx}\ &= \int_0^{+\infty}{\mathrm{e}^{-(\beta+2\pi is)x}dx}\ &= \left[\frac{\mathrm{e}^{-(\beta+2\pi is)x}}{\beta+2\pi is}\right]_0^{+\infty}\ &= \frac{1}{\beta+2\pi is} \end{aligned}$$

Now, let , then

So

这可以用于轻松地证明《普林斯顿概率论读本》习题 15.10.22:证明柯西分布是稳定的

图书馆 就闭馆了,苦于最近宿舍重新响起了沙二的枪声,遂到教学楼自习,突然一只猫学长走了进来,坐在桌上

猫转头猫正视猫趴着

最新评论

--