Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

MoonNote

sktime (time series 라이브러리) 본문

Study/Machine Learning

sktime (time series 라이브러리)

Kisung Moon 2021. 11. 5. 10:46

Sktime 이란?¶

파이썬의 시계열 분석을 위한 라이브러리
scikit-learn과 호환되며 다양한 시계열 알고리즘 및 도구 제공
https://opensourcelibs.com/libs/time-series-classification

Install¶

pip install sktime
conda install -c conda-forge sktime

Task¶

Forecasting (Regression)
Classification

Dataset¶

각 task에 적합한 총 18개의 데이터셋을 제공한다.

univariate for forecasting
multivariate for forecasting
univariate for classification
multivariate for classification

In [1]:

from sktime.datasets import (
    load_airline,              # 1 univariate forecasting
    load_PBS_dataset,          # 1 univariate forecasting
    load_shampoo_sales,        # 1 univariate forecasting
    load_lynx,                 # 1 univariate forecasting
    
    load_macroeconomic,        # 2 multivariate forecasting
    load_longley,              # 2 multivariate forecasting
    load_uschange,             # 2 multivariate forecasting
    
    load_acsf1,                # 3 univariate classification
    load_arrow_head,           # 3 univariate classification
    load_gunpoint,             # 3 univariate classification
    load_italy_power_demand,   # 3 univariate classification
    load_osuleaf,              # 3 univariate classification
    load_unit_test,            # 3 univariate classification
    
    load_japanese_vowels,      # 4 multivariate classification
    load_basic_motions,        # 4 multivariate classification
    
    #load_electric_devices_segmentation, 
    #load_gun_point_segmentation,        
    #load_UCR_UEA_dataset
)

Forecasting¶

univariate forecasting
multivariate forecasting

In [2]:

#from sktime.forecasting.all import *

Univariate Forecasting¶

데이터셋 불러오기
데이터셋 나누기
Forecasting 실행
시각화
성능 측정

데이터셋 불러오기¶

In [3]:

y = load_airline()
y.head()

Out[3]:

Period
1949-01    112.0
1949-02    118.0
1949-03    132.0
1949-04    129.0
1949-05    121.0
Freq: M, Name: Number of airline passengers, dtype: float64

In [4]:

y.index

Out[4]:

PeriodIndex(['1949-01', '1949-02', '1949-03', '1949-04', '1949-05', '1949-06',
             '1949-07', '1949-08', '1949-09', '1949-10',
             ...
             '1960-03', '1960-04', '1960-05', '1960-06', '1960-07', '1960-08',
             '1960-09', '1960-10', '1960-11', '1960-12'],
            dtype='period[M]', name='Period', length=144, freq='M')

데이터셋 나누기¶

In [5]:

from sktime.forecasting.model_selection import temporal_train_test_split

y_train, y_test = temporal_train_test_split(y, test_size=36)

데이터셋 시각화¶

In [6]:

from sktime.utils.plotting import plot_series

plot_series(y_train, y_test, labels=["y", "y_test"])

Out[6]:

(<Figure size 1152x288 with 1 Axes>,
 <AxesSubplot:ylabel='Number of airline passengers'>)

Forecasting 실행¶

In [7]:

from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.naive import NaiveForecaster

# 예측 범위 지정
fh = ForecastingHorizon(y_test.index, is_relative=False)
# 모델 설정
forecaster = NaiveForecaster(strategy="last", sp=12)
forecaster.fit(y_train)
# 예측
y_pred = forecaster.predict(fh)

In [8]:

from sktime.datasets import load_airline
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.naive import NaiveForecaster
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error

# step 1: 데이터 불러오기 및 나누기
y = load_airline()
y_train, y_test = temporal_train_test_split(y, test_size=36)

# step 2: forecasting 실행
fh = ForecastingHorizon(y_test.index, is_relative=False)
forecaster = NaiveForecaster(strategy="last", sp=12)
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)

# 시각화
plot_series(y_train, y_test, y_pred, labels=["y_train", "y_test", "y_pred"])

# step 3: evaluation metric 지정
# step 4: 성능 측정
mean_absolute_percentage_error(y_test, y_pred)

Out[8]:

0.145427686270316

모델 변경¶

sktime은 다양한 통계적 forecasting algorithm들을 제공한다.
https://github.com/alan-turing-institute/sktime/blob/922cd71d0d82d849025a080be826cf1c3c4777e5/sktime/forecasting/all/__init__.py#L88

In [9]:

from sktime.registry import all_estimators
import pandas as pd

all_estimators("forecaster", as_dataframe=True)

Out[9]:

	name	estimator
0	ARIMA	<class 'sktime.forecasting.arima.ARIMA'>
1	AutoARIMA	<class 'sktime.forecasting.arima.AutoARIMA'>
2	AutoETS	<class 'sktime.forecasting.ets.AutoETS'>
3	AutoEnsembleForecaster	<class 'sktime.forecasting.compose._ensemble.A...
4	BATS	<class 'sktime.forecasting.bats.BATS'>
5	ColumnEnsembleForecaster	<class 'sktime.forecasting.compose._column_ens...
6	Croston	<class 'sktime.forecasting.croston.Croston'>
7	DirRecTabularRegressionForecaster	<class 'sktime.forecasting.compose._reduce.Dir...
8	DirRecTimeSeriesRegressionForecaster	<class 'sktime.forecasting.compose._reduce.Dir...
9	DirectTabularRegressionForecaster	<class 'sktime.forecasting.compose._reduce.Dir...
10	DirectTimeSeriesRegressionForecaster	<class 'sktime.forecasting.compose._reduce.Dir...
11	EnsembleForecaster	<class 'sktime.forecasting.compose._ensemble.E...
12	ExponentialSmoothing	<class 'sktime.forecasting.exp_smoothing.Expon...
13	ForecastingGridSearchCV	<class 'sktime.forecasting.model_selection._tu...
14	ForecastingPipeline	<class 'sktime.forecasting.compose._pipeline.F...
15	ForecastingRandomizedSearchCV	<class 'sktime.forecasting.model_selection._tu...
16	HCrystalBallForecaster	<class 'sktime.forecasting.hcrystalball.HCryst...
17	MultioutputTabularRegressionForecaster	<class 'sktime.forecasting.compose._reduce.Mul...
18	MultioutputTimeSeriesRegressionForecaster	<class 'sktime.forecasting.compose._reduce.Mul...
19	MultiplexForecaster	<class 'sktime.forecasting.compose._multiplexe...
20	NaiveForecaster	<class 'sktime.forecasting.naive.NaiveForecast...
21	OnlineEnsembleForecaster	<class 'sktime.forecasting.online_learning._on...
22	PolynomialTrendForecaster	<class 'sktime.forecasting.trend.PolynomialTre...
23	Prophet	<class 'sktime.forecasting.fbprophet.Prophet'>
24	RecursiveTabularRegressionForecaster	<class 'sktime.forecasting.compose._reduce.Rec...
25	RecursiveTimeSeriesRegressionForecaster	<class 'sktime.forecasting.compose._reduce.Rec...
26	StackingForecaster	<class 'sktime.forecasting.compose._stack.Stac...
27	TBATS	<class 'sktime.forecasting.tbats.TBATS'>
28	ThetaForecaster	<class 'sktime.forecasting.theta.ThetaForecast...
29	TransformedTargetForecaster	<class 'sktime.forecasting.compose._pipeline.T...
30	TrendForecaster	<class 'sktime.forecasting.trend.TrendForecast...
31	UnobservedComponents	<class 'sktime.forecasting.structural.Unobserv...
32	VAR	<class 'sktime.forecasting.var.VAR'>

In [10]:

from sktime.forecasting.exp_smoothing import ExponentialSmoothing

y = load_airline()
y_train, y_test = temporal_train_test_split(y, test_size=36)

fh = ForecastingHorizon(y_test.index, is_relative=False)
forecaster = ExponentialSmoothing(trend="add", seasonal="additive", sp=12)
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)

plot_series(y_train, y_test, y_pred, labels=["y_train", "y_test", "y_pred"])

mean_absolute_percentage_error(y_test, y_pred)

Out[10]:

0.05027655720606656

In [11]:

from sktime.forecasting.arima import ARIMA

forecaster = ARIMA(
    order=(1, 1, 0), seasonal_order=(0, 1, 0, 12), suppress_warnings=True
)

forecaster.fit(y_train)
y_pred = forecaster.predict(fh)
plot_series(y_train, y_test, y_pred, labels=["y_train", "y_test", "y_pred"])
mean_absolute_percentage_error(y_pred, y_test)

Out[11]:

0.04257105757347649

Multivariate Forecasting¶

데이터셋 불러오기
데이터셋 나누기
Forecasting 실행
시각화
성능 측정

In [12]:

_, y = load_longley()
y.head()

Out[12]:

	GNPDEFL	GNP	UNEMP	ARMED	POP
Period
1947	83.0	234289.0	2356.0	1590.0	107608.0
1948	88.5	259426.0	2325.0	1456.0	108632.0
1949	88.2	258054.0	3682.0	1616.0	109773.0
1950	89.5	284599.0	3351.0	1650.0	110929.0
1951	96.2	328975.0	2099.0	3099.0	112075.0

In [13]:

from sktime.datasets import load_airline
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.naive import NaiveForecaster
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error

# step 1: 데이터 불러오기 및 나누기
_, y = load_longley()
y_train, y_test = temporal_train_test_split(y)

# step 2: forecasting 실행
fh = ForecastingHorizon(y_test.index, is_relative=False)
forecaster = NaiveForecaster(strategy="last")
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)

# 시각화
plot_series(y_train["POP"], y_test["POP"], y_pred["POP"], labels=["y_train", "y_test", "y_pred"])

# step 3: evaluation metric 지정
# step 4: 성능 측정
mean_absolute_percentage_error(y_test, y_pred)

Out[13]:

0.08039499198190428

In [74]:

import warnings
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

from sktime.datasets import load_shampoo_sales, load_italy_power_demand
from sktime.forecasting.compose import RecursiveTimeSeriesRegressionForecaster
from sktime.forecasting.model_selection import temporal_train_test_split

sns.set_style('whitegrid')

In [14]:

from sktime.forecasting.var import VAR

# step 1: 데이터 불러오기 및 나누기
_, y = load_longley()
y_train, y_test = temporal_train_test_split(y)

# step 2: forecasting 실행
fh = ForecastingHorizon(y_test.index, is_relative=False)
forecaster = VAR()
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)

# 시각화
plot_series(y_train["POP"], y_test["POP"], y_pred["POP"], labels=["y_train", "y_test", "y_pred"])

# step 3: evaluation metric 지정
# step 4: 성능 측정
mean_absolute_percentage_error(y_test, y_pred)

Out[14]:

0.08482383879246463

Univariate time series classification¶

In [46]:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

from sktime.classification.compose import ComposableTimeSeriesForestClassifier
from sktime.datasets import load_arrow_head
from sktime.utils.slope_and_trend import _slope

In [47]:

X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)

In [48]:

# univariate 데이터
X_train.head()

Out[48]:

	dim_0
122	0 -1.6961 1 -1.6806 2 -1.6574 3 ...
32	0 -1.6737 1 -1.6715 2 -1.6602 3 ...
142	0 -1.8981 1 -1.8790 2 -1.8566 3 ...
30	0 -1.9204 1 -1.9015 2 -1.8864 3 ...
73	0 -1.8132 1 -1.8255 2 -1.8166 3 ...

In [49]:

# target variable
labels, counts = np.unique(y_train, return_counts=True)
print(labels, counts)

['0' '1' '2'] [58 49 51]

In [50]:

fig, ax = plt.subplots(1, figsize=plt.figaspect(0.25))
for label in labels:
    X_train.loc[y_train == label, "dim_0"].iloc[0].plot(ax=ax, label=f"class {label}")
plt.legend()
ax.set(title="Example time series", xlabel="Time");

Scikit-learn 방식¶

time point에 해당하는 값들을 feature로 치환

In [51]:

from sklearn.ensemble import RandomForestClassifier

from sktime.datatypes._panel._convert import from_nested_to_2d_array

X_train_tab = from_nested_to_2d_array(X_train)
X_test_tab = from_nested_to_2d_array(X_test)

X_train_tab.head()

Out[51]:

	dim_0__0	dim_0__1	dim_0__2	dim_0__3	dim_0__4	dim_0__5	dim_0__6	dim_0__7	dim_0__8	dim_0__9	...	dim_0__241	dim_0__242	dim_0__243	dim_0__244	dim_0__245	dim_0__246	dim_0__247	dim_0__248	dim_0__249	dim_0__250
122	-1.6961	-1.6806	-1.6574	-1.6443	-1.6187	-1.5873	-1.5372	-1.5189	-1.4789	-1.4287	...	-1.4995	-1.5552	-1.5921	-1.6109	-1.6262	-1.6402	-1.6658	-1.6798	-1.6816	-1.6834
32	-1.6737	-1.6715	-1.6602	-1.6349	-1.6061	-1.5588	-1.5562	-1.5173	-1.4901	-1.4263	...	-1.4081	-1.4331	-1.4963	-1.5221	-1.5602	-1.5768	-1.6097	-1.6362	-1.6612	-1.6625
142	-1.8981	-1.8790	-1.8566	-1.8160	-1.8048	-1.7729	-1.7545	-1.7027	-1.6606	-1.6129	...	-1.6285	-1.6869	-1.7297	-1.7631	-1.7927	-1.8192	-1.8330	-1.8704	-1.8827	-1.8985
30	-1.9204	-1.9015	-1.8864	-1.8678	-1.8133	-1.7729	-1.7501	-1.7205	-1.6654	-1.6369	...	-1.5643	-1.6283	-1.6402	-1.6773	-1.7094	-1.7512	-1.7945	-1.8678	-1.9019	-1.9039
73	-1.8132	-1.8255	-1.8166	-1.8025	-1.7866	-1.7659	-1.7616	-1.7547	-1.7455	-1.7145	...	-1.2668	-1.3390	-1.4362	-1.5041	-1.5512	-1.6177	-1.6687	-1.7403	-1.7732	-1.8038

5 rows × 251 columns

In [52]:

classifier = RandomForestClassifier(n_estimators=100)
classifier.fit(X_train_tab, y_train)
y_pred = classifier.predict(X_test_tab)
accuracy_score(y_test, y_pred)

Out[52]:

0.8490566037735849

Feature extraction¶

시계열 데이터에서 특징을 추출한 후 이를 활용

In [53]:

from sktime.transformations.panel.tsfresh import TSFreshFeatureExtractor

transformer = TSFreshFeatureExtractor(default_fc_parameters="minimal")
extracted_features = transformer.fit_transform(X_train)
extracted_features.head()

/usr/local/lib/python3.6/dist-packages/sktime/transformations/panel/tsfresh.py:164: UserWarning:

tsfresh requires a unique index, but found non-unique. To avoid this warning, please make sure the index of X contains only unique values.

Feature Extraction: 100%|██████████| 5/5 [00:00<00:00, 32.56it/s]

Out[53]:

	dim_0__sum_values	dim_0__median	dim_0__mean	dim_0__length	dim_0__standard_deviation	dim_0__variance	dim_0__root_mean_square	dim_0__maximum	dim_0__minimum
0	0.000197	0.218390	7.848606e-07	251.0	0.998006	0.996017	0.998006	1.2427	-1.6961
1	-0.000356	0.312720	-1.418327e-06	251.0	0.998003	0.996011	0.998003	1.1377	-1.6737
2	0.000279	-0.020420	1.111554e-06	251.0	0.998007	0.996018	0.998007	1.3738	-1.8985
3	0.000071	-0.166200	2.828685e-07	251.0	0.998009	0.996021	0.998009	1.5740	-1.9204
4	0.000015	-0.020305	5.976096e-08	251.0	0.998005	0.996013	0.998005	1.3624	-1.8255

In [54]:

from sklearn.pipeline import make_pipeline

classifier = make_pipeline(
    TSFreshFeatureExtractor(show_warnings=False), RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

/usr/local/lib/python3.6/dist-packages/sktime/transformations/panel/tsfresh.py:164: UserWarning:

tsfresh requires a unique index, but found non-unique. To avoid this warning, please make sure the index of X contains only unique values.

Feature Extraction: 100%|██████████| 5/5 [00:09<00:00,  1.95s/it]
/usr/local/lib/python3.6/dist-packages/sktime/transformations/panel/tsfresh.py:164: UserWarning:

tsfresh requires a unique index, but found non-unique. To avoid this warning, please make sure the index of X contains only unique values.

Feature Extraction: 100%|██████████| 5/5 [00:03<00:00,  1.51it/s]

Out[54]:

0.8679245283018868

Time series classification¶

Time series forest : 랜덤 포레스트의 시계열 버전

데이터를 여러 개의 random한 구간으로 분할한다.
각 구간에서 특징(평균, 표준편차, 기울기)을 추출하고,
추출된 특징에 대해 학습한다.
1 - 3 step 앙상블

In [55]:

from sktime.transformations.panel.summarize import RandomIntervalFeatureExtractor

steps = [
    (
        "extract",
        RandomIntervalFeatureExtractor(
            n_intervals="sqrt", features=[np.mean, np.std, _slope]
        ),
    ),
    ("clf", DecisionTreeClassifier()),
]
time_series_tree = Pipeline(steps)

In [56]:

time_series_tree.fit(X_train, y_train)
time_series_tree.score(X_test, y_test)

Out[56]:

0.7358490566037735

In [57]:

tsf = ComposableTimeSeriesForestClassifier(
    estimator=time_series_tree,
    n_estimators=100,
    criterion="entropy",
    bootstrap=True,
    oob_score=True,
    random_state=1
)

In [58]:

tsf.fit(X_train, y_train)

if tsf.oob_score:
    print(tsf.oob_score_)

0.8481012658227848

In [59]:

tsf = ComposableTimeSeriesForestClassifier()
tsf.fit(X_train, y_train)
tsf.score(X_test, y_test)

Out[59]:

0.8679245283018868

Feature 중요도¶

In [31]:

fi = tsf.feature_importances_
# renaming _slope to slope.
fi.rename(columns={"_slope": "slope"}, inplace=True)
fig, ax = plt.subplots(1, figsize=plt.figaspect(0.25))
fi.plot(ax=ax)
ax.set(xlabel="Time", ylabel="Feature importance");

Multivariate time series classification¶

Time series concatenation
Column ensembling
Bespoke classification algorithms

In [32]:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sktime.classification.compose import ColumnEnsembleClassifier
from sktime.classification.dictionary_based import BOSSEnsemble
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.classification.shapelet_based import MrSEQLClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator

In [33]:

X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)

In [34]:

#  multivariate input data
X_train.head()

Out[34]:

	dim_0	dim_1	dim_2	dim_3	dim_4	dim_5
9	0 -0.407421 1 -0.407421 2 2.355158 3...	0 1.413374 1 1.413374 2 -3.928032 3...	0 0.092782 1 0.092782 2 -0.211622 3...	0 -0.066584 1 -0.066584 2 -3.630177 3...	0 0.223723 1 0.223723 2 -0.026634 3...	0 0.135832 1 0.135832 2 -1.946925 3...
24	0 0.383922 1 0.383922 2 -0.272575 3...	0 0.302612 1 0.302612 2 -1.381236 3...	0 -0.398075 1 -0.398075 2 -0.681258 3...	0 0.071911 1 0.071911 2 -0.761725 3...	0 0.175783 1 0.175783 2 -0.114525 3...	0 -0.087891 1 -0.087891 2 -0.503377 3...
5	0 -0.357300 1 -0.357300 2 -0.005055 3...	0 -0.584885 1 -0.584885 2 0.295037 3...	0 -0.792751 1 -0.792751 2 0.213664 3...	0 0.074574 1 0.074574 2 -0.157139 3...	0 0.159802 1 0.159802 2 -0.306288 3...	0 0.023970 1 0.023970 2 1.230478 3...
7	0 -0.352746 1 -0.352746 2 -1.354561 3...	0 0.316845 1 0.316845 2 0.490525 3...	0 -0.473779 1 -0.473779 2 1.454261 3...	0 -0.327595 1 -0.327595 2 -0.269001 3...	0 0.106535 1 0.106535 2 0.021307 3...	0 0.197090 1 0.197090 2 0.460763 3...
34	0 0.052231 1 0.052231 2 -0.54804...	0 -0.730486 1 -0.730486 2 0.70700...	0 -0.518104 1 -0.518104 2 -1.179430 3...	0 -0.159802 1 -0.159802 2 -0.239704 3...	0 -0.045277 1 -0.045277 2 0.023970 3...	0 -0.029297 1 -0.029297 2 0.29829...

In [35]:

# multi-class target variable
np.unique(y_train)

Out[35]:

array(['badminton', 'running', 'standing', 'walking'], dtype=object)

In [36]:

# step 1: 데이터 불러오기 및 나누기
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Time series concatenation¶

multivariate 데이터를 긴 univariate data로 변환하여 univariate의 분류기 적용

In [37]:

steps = [
    ("concatenate", ColumnConcatenator()),
    ("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

Out[37]:

1.0

Column ensembling¶

각 시계열 열에 대해 예측하는 모델들을 앙상블

In [38]:

clf = ColumnEnsembleClassifier(
    estimators=[
        ("TSF0", TimeSeriesForestClassifier(n_estimators=100), [0]),
        ("BOSSEnsemble3", BOSSEnsemble(max_ensemble_size=5), [3]),
    ]
)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

Out[38]:

0.95

Bespoke classification algorithms¶

In [39]:

clf = MrSEQLClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

Out[39]:

1.0

In [ ]:

'Study > Machine Learning' 카테고리의 다른 글

Pretext task (0)	2021.12.01
Shapley Value (0)	2021.11.29
딥러닝에서 비선형 활성화함수를 쓰는 이유? (0)	2021.11.11
Model Assessment and Selection (0)	2021.10.23
Multilabel classification (0)	2021.05.05

'Study/Machine Learning' Related Articles

Comments

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31