loading...
利用 autogluon 进行时间序列预测
发表于:2023-08-22 | 分类: 技术

autogluon 是一个非常方便的机器学习库

它可以只用三行代码就解决了复杂的机器学习问题

下面附上我在官网文档看到的时间序列预测的样例

[机器学习]: https://auto.gluon.ai/stable/tutorials/timeseries/forecasting-indepth.html “”title””

1
2
import pandas as pd
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

这个是进行库的导入,注意autogluon库可以通过pycharm直接安装,我开VPN的时候安装失败了

带静态特征的时间序列

导入数据

1
2
3
4
train_data = TimeSeriesDataFrame.from_path(
"https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_daily_subset/train.csv",
)
train_data.head()

autogluon期望将静态特征作为dataframe对象,其中索引应该包括item_ids

1
2
3
4
5
static_features = pd.read_csv(
"https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_daily_subset/metadata.csv",
index_col="item_id",
)
static_features.head()

将静态特征附着到时间序列中的dataframe来

1
train_data.static_features = static_features

domain:表示分类特征

1
2
3
4
5
6
7
8
9

predictor = TimeSeriesPredictor(prediction_length=14).fit(train_data)
...
Following types of static features have been inferred:
categorical: ['domain']
continuous (float): []
...
#train_data.static_features["store_id"] = list(range(len(train_data.item_ids)))
#train_data.static_features["store_id"] = train_data.static_features["store_id"].astype("category")

带变化的时间序列

时间序列

生成covariates

1
2
3
4
5
6
7
8
import numpy as np
train_data["log_target"] = np.log(train_data["target"])

WEEKEND_INDICES = [5, 6]
timestamps = train_data.index.get_level_values("timestamp")
train_data["weekend"] = timestamps.weekday.isin(WEEKEND_INDICES).astype(float)

train_data.head()

注意代码中的weekend,这就代表了它的”季节性”

1
2
3
4
5
6
...
Provided dataset contains following columns:
target: 'target'
known covariates: ['weekend']
past covariates: ['log_target']
...

最后,为了进行预测,我们生成预测范围的已知协变量

1
2
3
4
5
6
7
8
from autogluon.timeseries.utils.forecast import get_forecast_horizon_index_ts_dataframe

future_index = get_forecast_horizon_index_ts_dataframe(train_data, prediction_length=14)
future_timestamps = future_index.get_level_values("timestamp")
known_covariates = pd.DataFrame(index=future_index)
known_covariates["weekend"] = future_timestamps.weekday.isin(WEEKEND_INDICES).astype(float)

known_covariates.head()

What data format is expected by TimeSeriesPredictor?

  • 训练数据必须至少包含一些长度 ≥ 2 * Prediction_length + 1 的时间序列。这是确保有一些数据可用作内部验证集所必需的。
  • 所有时间序列都必须定期采样
  • 数据中不存在缺失值

如何处理不规则数据和缺失数据

以下是具有不规则时间索引的数据集的示例:

1
2
3
4
5
6
7
8
9
10
df_irregular = TimeSeriesDataFrame(
pd.DataFrame(
{
"item_id": [0, 0, 0, 1, 1],
"timestamp": ["2022-01-01", "2022-01-02", "2022-01-04", "2022-01-01", "2022-01-04"],
"target": [1, 2, 3, 4, 5],
}
)
)
df_irregular

显然在时间戳缺失了

我们可以验证该索引现在是有规律的并且具有每日频率

1
print(f"Data has frequency '{df_regular.freq}'")

这里的结果显示为D,表示时间戳具有每日规律

但是,现在数据包含用 NaN 表示的缺失值。为了填充 NaN,我们使用 TimeSeriesDataFrame.fill_missing_values() 方法来实现各种插补策略。

1
2
df_filled = df_regular.fill_missing_values()
df_filled

How to evaluate forecast accuracy?

数据集的划分很重要

1
2
3
prediction_length = 48
data = TimeSeriesDataFrame.from_path("https://autogluon.s3.amazonaws.com/datasets/timeseries/m4_hourly_subset/train.csv")
train_data, test_data = data.train_test_split(prediction_length)

进行绘图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import matplotlib.pyplot as plt
import numpy as np

item_id = "H1"
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=[10, 4], sharex=True)
train_ts = train_data.loc[item_id]
test_ts = test_data.loc[item_id]
ax1.set_title("Train data (past time series values)")
ax1.plot(train_ts)
ax2.set_title("Test data (past + future time series values)")
ax2.plot(test_ts)
for ax in (ax1, ax2):
ax.fill_between(np.array([train_ts.index[-1], test_ts.index[-1]]), test_ts.min(), test_ts.max(), color="C1", alpha=0.3, label="Forecast horizon")
plt.legend()
plt.show()

模型评估

1
2
3
predictor = TimeSeriesPredictor(prediction_length=prediction_length, eval_metric="MASE").fit(train_data)
predictor.evaluate(test_data)

上一篇:
日记本
下一篇:
这些建模的日子