Household Electric Power Consumption Prediction

Household electric power consumption prediction using LSTM

LSTM has many types of usage.
We describe three types of usage as bellow, but there are other types of usage.
There are many types of LSTM, we use many to one case this time.
The reference of the dataset is below.
Individual household electric power consumption Data Set
Georges Hébrail (georges.hebrail '@', Senior Researcher, EDF R&D, Clamart, France
Alice Bérard, TELECOM ParisTech Master of Engineering Internship at EDF R&D, Clamart, France

Required Libaries

  • matplotlib 2.0.2
  • numpy 1.12.1
  • scikit-learn 0.18.2
  • pandas 0.20.3
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error
import renom as rm
from renom.optimizer import Adam
from renom.cuda import set_cuda_active

Create dataset for training and prediction

We use 30 minutes state to predict next 30 minutes total electric power consumption after 60 minutes.

In [2]:
def create_dataset(data, look_back, period, blank):
    X, y = [], []
    for i in range(len(data)-look_back-period-blank):
        X.append(data[i : i+look_back, :])
        watsum = sum(list(map(float,data[i+blank+look_back : i+look_back+blank+period][0])))
    n_features = np.array(X).shape[2]
    X = np.reshape(np.array(X), [-1, look_back, n_features])
    y = np.reshape(np.array(y), [-1, 1])
    return X, y

Split the data

In [3]:
def split_data(X, y, test_size=0.1):
    pos = int(round(len(X) * (1-test_size)))
    X_train, y_train = X[:pos], y[:pos]
    X_test, y_test = X[pos:], y[pos:]
    return X_train, y_train, X_test, y_test

Load the data from txt file and drop the missing raw

In [4]:
filename = "household_power_consumption.txt"
df = pd.read_csv(filename,sep=";", usecols=[2,3,4,5,6,7,8], low_memory=False)
print("the number of {} records:{}\n".format(filename, len(df.index)))
df = df.applymap(lambda d: np.nan if d=="?" else d)
print("missing value info:\n{}\n".format(df.isnull().sum(axis=0)))
df = df.dropna(axis=0)
print("the number of {} records after trimming:{}\n".format(filename, len(df.index)))

ds = df.values.astype("float32")
the number of household_power_consumption.txt records:2075259

missing value info:
Global_active_power      25979
Global_reactive_power    25979
Voltage                  25979
Global_intensity         25979
Sub_metering_1           25979
Sub_metering_2           25979
Sub_metering_3           25979
dtype: int64

the number of household_power_consumption.txt records after trimming:2049280

Preprocessing and define the model

Firstly, normalize the data from minimum 0 to maximum 1.
Usually, you may use scikit learn minmaxscaler function, but we define its myself to restore original scale.
minmaxScaler is normalizing function, and undoScaler function is restore function.
look back is how long we use for the training, and period is how long we use for prediction.
We use 30 minutes for traning and predict next 30 minutes electric power consumption,
so look back is set to 30 and period is set to 30.
And it is too big data for processing(cost too much time), swe restricted the data size to 100000.
In [5]:
def minmaxScaler(data, maxlist, minlist):
    for i in range(data.shape[-1]):
        if maxlist[i] - minlist[i] == 0:
            data[..., i] = 1
            data[..., i] = (data[..., i] - minlist[i]) / (maxlist[i] - minlist[i])
    return data

def undoScaler(data, maxlist, minlist):
    for i in range(data.shape[-1]):
        if maxlist[i] - minlist[i] == 0:
            data[..., i] = maxlist[i] * 1
            data[..., i] = data[..., i] * (maxlist[i] - minlist[i]) + minlist[i]
    return data

look_back = 30
blank = 60
period = 30
X, y = create_dataset(ds, look_back, period, blank)
X, y = X[:100000, :, :], y[:100000, :]
maxlist_data = np.max(X.reshape(X.shape[0]*X.shape[1], X.shape[2]), axis=0)
minlist_data = np.min(X.reshape(X.shape[0]*X.shape[1], X.shape[2]), axis=0)
maxlist_label = np.max(y).reshape(-1,1)
minlist_label = np.min(y).reshape(-1,1)
plt.title("electric power consumption for 30 minutes")
X = minmaxScaler(X, maxlist_data, minlist_data)
y = minmaxScaler(y, maxlist_label, minlist_label)
X_train, y_train, X_test, y_test = split_data(X, y, 0.33)
print("X_train:{},y_train:{},X_test:{},y_test:{}".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))

sequential = rm.Sequential([

batch_size = 2048
epoch = 30
N = len(X_train)
T = X_train.shape[1]
X_train:(67000, 30, 7),y_train:(67000, 1),X_test:(33000, 30, 7),y_test:(33000, 1)

Training loop

In [6]:
learning_curve = []
test_learning_curve = []
optimizer = Adam(lr=0.01)
for i in range(epoch):
    train_loss = 0
    test_loss = 0
    for j in range(N//batch_size):
        train_batch = X_train[j*batch_size : (j+1)*batch_size]
        response_batch = y_train[j*batch_size : (j+1)*batch_size]
        l = 0
        z = 0
        with sequential.train():
            for t in range(T):
                z = sequential(train_batch[:, t, :])
                l += rm.mse(z, response_batch)
            l /= T
        train_loss += l.as_ndarray()
    train_loss = train_loss / (N // batch_size)
    l_test = 0
    z = 0
    for t in range(T):
        z = sequential(X_test[:, t, :])
        l_test += rm.mse(z, y_test)
    l_test /= T
    test_loss = l_test.as_ndarray()
    print("epoch:{} train loss:{} test loss:{}".format(i, train_loss, test_loss))
epoch:0 train loss:0.009572499962814618 test loss:0.0067715514451265335
epoch:1 train loss:0.005639389630232472 test loss:0.005541081074625254
epoch:2 train loss:0.0051836720303981565 test loss:0.005303853657096624
epoch:3 train loss:0.005126343821757473 test loss:0.005252862349152565
epoch:4 train loss:0.0051105242455378175 test loss:0.005215560086071491
epoch:5 train loss:0.0051049498724751174 test loss:0.005193015094846487
epoch:6 train loss:0.0050972960161743686 test loss:0.005174010992050171
epoch:7 train loss:0.00508945070032496 test loss:0.005157759413123131
epoch:8 train loss:0.005082122712337878 test loss:0.005143859423696995
epoch:9 train loss:0.005075385881355032 test loss:0.005131811834871769
epoch:10 train loss:0.005069269987870939 test loss:0.005121266003698111
epoch:11 train loss:0.005063754346338101 test loss:0.005111947655677795
epoch:12 train loss:0.005058795439254027 test loss:0.0051036374643445015
epoch:13 train loss:0.005054342138464563 test loss:0.005096168722957373
epoch:14 train loss:0.005050342413596809 test loss:0.005089408252388239
epoch:15 train loss:0.005046747217420489 test loss:0.0050832550041377544
epoch:16 train loss:0.0050435115190339275 test loss:0.0050776260904967785
epoch:17 train loss:0.005040595249738544 test loss:0.005072456784546375
epoch:18 train loss:0.005037962750066072 test loss:0.0050676921382546425
epoch:19 train loss:0.005035582260461524 test loss:0.0050632888451218605
epoch:20 train loss:0.005033426670706831 test loss:0.005059210117906332
epoch:21 train loss:0.005031472093833145 test loss:0.00505542429164052
epoch:22 train loss:0.0050296964109293185 test loss:0.005051903426647186
epoch:23 train loss:0.005028080631745979 test loss:0.005048623774200678
epoch:24 train loss:0.005026608763728291 test loss:0.005045564845204353
epoch:25 train loss:0.005025265425501857 test loss:0.00504270801320672
epoch:26 train loss:0.005024037745897658 test loss:0.005040035117417574
epoch:27 train loss:0.005022913828724995 test loss:0.0050375331193208694
epoch:28 train loss:0.0050218828255310655 test loss:0.005035187117755413
epoch:29 train loss:0.0050209358960273676 test loss:0.0050329845398664474

Predict and show some results

We'll see root mean squared error and figure about prediction as result.
Root mean squared error is useful for evaluation to estimate the number of errors.
In [7]:
for t in range(T):
    test_predict = sequential(X_test[:, t, :])
test_predict = np.array(test_predict)

y_test_raw = undoScaler(y_test.reshape(-1,1), maxlist_label, minlist_label)
test_predict_raw = undoScaler(test_predict.reshape(-1,1), maxlist_label, minlist_label)

print("Root mean squared error:{}".format(np.sqrt(mean_squared_error(y_test_raw, test_predict_raw))))

plt.plot(y_test_raw, label ="original")
plt.plot(test_predict_raw, label="test_predict")
Root mean squared error:16.82746141634212