以竞赛的方式学习LSTM

一、背景介绍

本次的数据是呼吸机压力测试，在这个比赛中，参与者被给予许多呼吸的时间序列，并且将学习在给定控制输入的时间序列的情况下预测呼吸期间呼吸回路中的气道压力。我们的目标是预测：压力-呼吸回路中测得的气道压力，单位为cmH 2 O，因此这是一个回归问题，由于是时间序列，因此我们可以尝试使用RNN或LSTM等模型。

本次案例学习分为两个步骤，第一部分：以LSTM算法，简单的建立baseline模型，作为我们的基线，这是本次的学习重点；第二部分：尝试修改基线，提升模型的效果，具体的做法可以尝试更换其他模型、尝试更多的特征组合，这是下一篇文章的内容。

二、数据探索

2.1 训练数据和测试数据

train数据有440MB，test数据有220MB，以CSV的文件格式保存。二者的数据字段如下所示：

id：整个文件中全局唯一的时间点标识符。

breath_id：呼吸的全局唯一时间步长。

R：肺部属性，指示气道受限程度（cmH 2 O/L/S）。物理上，这是每单位流量变化的压力变化（每单位时间的空气体积）。直觉上，人们可以想象用吸管吹气球。我们可以通过改变吸管的直径来改变R，R越大，越难吹。

C：指示肺顺应性的肺属性（单位：mL/cmH 2 O）。从物理上讲，这是每改变一次压力所引起的体积变化。直观地，可以想象相同的气球示例。我们可以通过改变气球乳胶的厚度来改变C，C越高，乳胶越薄，越容易吹。

time_step：实际时间戳。

u_id：吸气电磁阀的控制输入。范围从0到100。

u_out：探测电磁阀的控制输入。0或1。

pressure：压力-呼吸回路中测得的气道压力，单位为cmH 2 O。这是我们的预测目标，两个数据集只有这里存在差异。

2.2 查看数据

首先导入相关库，查看数据：

import pandas as pd
import numpy as np
import gc
import time
import matplotlib.pyplot as plt
import seaborn as sns

train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
sub = pd.read_csv('../input/sample_submission.csv')

#查看数据的前五行
train_df.head()
#查看数据的缺失值
train_df.isnull().sum()

结果1-前五行：

结果2-缺失值:

可以看到，我们的数据基本是float或int类型，尽管有个别属于类别特征，但也转化成了数值。而且我们的数据很干净，没有缺失值，这为我们省去了填充缺失值的步骤。此时我们进行简单的数据可视化。

2.3 数据可视化

我们先查看两个比较重要的肺部属性分布。

fig = plt.figure(figsize = (13, 8))
rc = ['R', 'C']
for i in rc:
    plt.subplot(2, 2, rc.index(i)+1)
    plt.title(i, y = 1.2, size = 25, fontname = 'monospace', color = 'black')
    a = sns.countplot(x = i, data = train_df, palette = ['#488a99', '#dbae58', '#4b585c'])
    plt.ylabel('')
    plt.xlabel('')
    plt.xticks(fontname = 'monospace', size = 12)
    plt.yticks([])
    for j in ['right', 'top']:
        a.spines[j].set_visible(False)
    for j in ['bottom', 'left']:    
        a.spines[j].set_linewidth(1.2)
        
    summ = 0
    for p in a.patches:
        summ += p.get_height()

    for p in a.patches:
        height = p.get_height()
        a.annotate(f'{height}', (p.get_x() + p.get_width() / 2, p.get_height()), 
                   ha = 'center', va = 'center', 
                   size = 13,
                   xytext = (1, -15), 
                   textcoords = 'offset points',
                   fontname = 'monospace', color = 'white')
        a.annotate(f'{round((height/summ) * 100, 1)}%', (p.get_x() + p.get_width() / 2, p.get_height()), 
                   ha = 'center', va = 'center', 
                   size = 15,
                   xytext = (1, 13), 
                   textcoords = 'offset points',
                   fontname = 'monospace', color = 'black')   
        
for i in rc:
    plt.subplot(2, 2, rc.index(i)+3)
    a = sns.countplot(x = i, data = test_df, palette = ['#488a99', '#dbae58', '#4b585c'])
    plt.ylabel('')
    plt.xlabel('')
    plt.xticks(fontname = 'monospace', size = 12)
    plt.yticks([])
    for j in ['right', 'top']:
        a.spines[j].set_visible(False)
    for j in ['bottom', 'left']:    
        a.spines[j].set_linewidth(1.2)
        
    summ = 0
    for p in a.patches:
        summ += p.get_height()

    for p in a.patches:
        height = p.get_height()
        a.annotate(f'{height}', (p.get_x() + p.get_width() / 2, p.get_height()), 
                   ha = 'center', va = 'center', 
                   size = 13,
                   xytext = (1, -15), 
                   textcoords = 'offset points',
                   fontname = 'monospace', color = 'white')
        a.annotate(f'{round((height/summ) * 100, 1)}%', (p.get_x() + p.get_width() / 2, p.get_height()), 
                   ha = 'center', va = 'center', 
                   size = 15,
                   xytext = (1, 13), 
                   textcoords = 'offset points',
                   fontname = 'monospace', color = 'black')
        
plt.figtext(0.15, 1.1, 'Distribution of lung attributes (R/C)', fontname = 'monospace', size = 30, color = 'black')
plt.figtext(1.03, 0.15, 'TEST', fontname = 'monospace', size = 25, color = 'black', rotation = 90)
plt.figtext(1.03, 0.7, 'TRAIN', fontname = 'monospace', size = 25, color = 'black', rotation = 90)
        
fig.tight_layout(h_pad = 10)
fig.savefig('r_c.png')
plt.show()

结果1- R和C在两个数据内部的可视化:

有细微的差距，但是对我们来说影响不大，这对于我们来说是好事，说明符合机器学习或深度学习的基本要求。从上图我们也可以看到，二者的取值只有3个数值，因此我们查看一下在二者不同取值情况下的密度图。

fig = plt.figure(figsize = (15, 12))
plot = 1
for i in range(3):
    rr = r[i]
    for k in range(3):
        cc = c[k]
        plt.subplot(3, 3, plot)
        plt.title(f'R = {rr} | C = {cc}', 
        fontname = 'monospace', size = 15, color = 'black')
        a = sns.kdeplot(train_df.query(
            'time_step < 0.000001 & 
            u_in < 0.000001 & 
            R == @rr & C == @cc')['pressure'], 
            color = '#488a99', shade = True, 
            alpha = 1, linewidth = 1.5, edgecolor = 'black')
        plt.ylabel('')
        plt.xlabel('')
        plt.xticks(size = 12, fontname = 'monospace')
        plt.yticks([])

        for j in ['right', 'top']:
            a.spines[j].set_visible(False)
        for j in ['bottom', 'left']:    
            a.spines[j].set_linewidth(1.2)
            
        plot += 1

y = 1.27
for i in range(3):
    rr = r[i]
    y -= 0.333
    x = -0.315
    for k in range(3):
        cc = c[k]
        x += 0.333
        plt.figtext(x, y, 
        f'Min: {round(train.query("time_step < 0.000001 & u_in < 0.000001 & R == @rr & C == @cc")["pressure"].min(),2)}', 
        fontname = 'monospace', color = 'black')
        
        plt.figtext(x, y-0.02, 
        f'Max: {round(train.query(
        "time_step < 0.000001 & u_in < 0.000001 & R == @rr & C == @cc")["pressure"].max(),2)}', 
        fontname = 'monospace')
        plt.figtext(x, y-0.04, 
        f'Mean: {round(train.query(
        "time_step < 0.000001 & u_in < 0.000001 & R == @rr & C == @cc")["pressure"].mean(),2)}', 
        fontname = 'monospace', color = 'black')
        plt.figtext(x, y-0.06, 
        f'Median: {round(train.query(
        "time_step < 0.000001 & u_in < 0.000001 & R == @rr & C == @cc")["pressure"].median(),2)}', 
        fontname = 'monospace', color = 'black')
        
plt.figtext(0.01, 1.08, 
      'Distribution of pressure depending on lung attributes', 
       fontname = 'monospace', size = 30, color = 'black')
        
fig.tight_layout(h_pad = 3)
plt.show()

结果：

可以看到，在二者不同取值的情况下，存在双峰形态，一般而言，我们期待正态分布，双峰形态说明数据交互之下，隐含了许多信息。

这是一个时间序列问题，考虑到我们需要使用LSTM作为基线模型，因此我们需要确定”时间步长”。观察数据，”breath_id”可能是需要考虑的时间步长，究竟多长？我们查看一下。

unique_, unique_cont = np.unique(
          train['breath_id'], 
          return_counts=True)
print('唯一值个数：', unique_)
print('每个唯一值的数量大小：', unique_cont)
print('每个唯一值数量大小的唯一值： ', np.unique(unique_cont))

结果：

可以看到，唯一值有125749个ID，并且每个ID都是80个长度，因此在后续的处理中，时间步长=80，这是我们需要转化的结果。

三、torch.LSTM

在使用LSTM之前，先来说明一下LSTM的相关参数。

lstm = nn.LSTM(
  input_size,
  hidden_size,
  num_layers,
  bias,
  batch_first,
  drop_out,
  bidirectional,
  proj_size
)

"""
# 1、参数介绍：
input_size: int, 输入维度大小。
hidden_size: int, lstm循环单元的隐藏层大小，一般也是其输出层大小。
num_layers: int, 一个循环单元的层数或深度，可以理解为DNN中的隐藏层数量。
bias: bool, 偏置项，默认为True。
batch_first: bool, 是否将批量大小放在第一个维度，默认为False，
它要求我们根据这个参数，更改输入数据的维度。
False-> (time_step, batch_size, input_size)
True->  (batch_size, time_step, input_size)
drop_out: float，循环神经单元内部的失活率，用来防止过拟合，
默认为0.0，如果不为0.0，将会在每个lstm循环单元的最后一层使用。
bidirectional：bool，默认为False，表示单向循环=1，
如果为True，则为双向=2，在内部会将数据从左往后、从后往前，
按照相同的步长输入，也因为在输出时维度会发生变化。
proj_size：基本很少使用。默认为0。
"""

"""
# 2、输入数据和输出数据：
outs, (h_s, c_s) = lstm(input,(h_0, c_0))
1)、input: 
正常情况下，维度->(time_step, input_size)，
但由于我们是按照批量大小输入，因此维度默认情况下：
(time_step, batch_size, input_size)。
由于一些人习惯问题，喜欢将batch_size放在前面，
因此会有：(batch_size, time_step, input_size)，
此时需要将前面说的batch_size=True.

2)、h_0和c_0
二者的维度是一样的，h_0表示初始的短期记忆值，c_0表示初始的长期记忆值。
默认维度：(D*num_layers, hidden_size)。
如果存在批量大小，(D*num_layers, batch_size, hidden_size)
D:表示单向或双向，如果是单向，则为1，否则为2。

3)、输出数据维度
outs: 表示每个时间步长，最后一层的输出结果。以下均是存在批量大小的情况。
维度：(time_step, batch_size, D*hidden_size) 或者 
(batch_size, time_step, D*hidden_size)
h_s和c_s的维度是类似的：
(D*num_layers, batch_size, hidden_size)
"""

四、LSTM baseline

4.1 数据处理

在这里，我们需要将数据处理成模型需要的维度，并且按照批量大小输入，我们还需要对数据进行额外的缩放处理，让所有特征保持在一个尺度。由于原始数据量太大，有几百万行，因此我们根据特征”breath_id”选择部分行。

#固定随机种子，尽量保持结果一致。
import random
import os

def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything()

#由于数据量很大，因此我们考虑对部分数据进行训练。
#获取2500个id值，可能会出现重复，因此实际上不会有2500个id，然后去重，选择相应的数据即可。
luck_breath_id = np.random.randint(low=1, high=len(unique_), size=2500)
luck_id = list(set(list(luck_breath_id)))  #获取非重复值的id，并且转化为列表
train = train[train['breath_id'].isin(luck_id)].reset_index(drop=True)
#查看随机选择的特征
print(train['breath_id'].nunique())
#输出结果：1518，需要注意每个id都有80行时间序列，
#因此实际行会有：1518*80行。

下面我们提取需要的特征，剔除不需要的特征，如果后续我们重新创建了新的特征，下面的这一部分代码可以不用修改。

from sklearn.preprocessing import RobustScaler

excul = ['pressure', 'id', 'breath_id'] #不需要训练的特征
feature = [col for col in train.columns.to_list() if col not in excul]
print('当前特征: ', feature)

#数值缩放
RS = RobustScaler()
train.loc[:, feature] = RS.fit_transform(train[feature])
test.loc[:, feature] = RS.transform(test[feature])

#只含有训练特征的相关数据
feature_df = train.loc[:, feature].values

#转化维度，(batch_size, 80, 5)
X_all = feature_df.reshape(-1, 80, feature_df.shape[-1])
y_all = train.pressure.values.reshape(-1, 80)

input_dim = X_all.shape[-1]
print('数据大小: ',len(X_all))
print('输入维度：', input_dim)

结果：

可以看到，目前我们的特征只有5个，因此输入维度是5，而且数据量共有1518个”breath_id”。

4.2 模型建立

模型结果说明：使用双向LSTM，输出结果out(不使用两个记忆单元值)，喂入到后续的MLP结果中，得到对应的结果。

class Model(nn.Module):
    
    def __init__(self, input_size):
        
        hidden = [400, 300, 200, 100]
        super().__init__()
        self.lstm1 = nn.LSTM(input_size, hidden[0],
                             batch_first=True, bidirectional=True)
        self.lstm2 = nn.LSTM(2 * hidden[0], hidden[1],
                             batch_first=True, bidirectional=True)
        self.lstm3 = nn.LSTM(2 * hidden[1], hidden[2],
                             batch_first=True, bidirectional=True)
        self.lstm4 = nn.LSTM(2 * hidden[2], hidden[3],
                             batch_first=True, bidirectional=True)
        
        #由于使用双向LSTM，会将左右两个方向的输出维度拼接
        #因此需要 2*hidden[3], 传入的维度(batch_size, time_step, 2*hidden[3])
        self.fc1 = nn.Linear(2 * hidden[3], 50)
        self.selu = nn.SELU()
        self.fc2 = nn.Linear(50, 1)
        self._reinitialize()

    def _reinitialize(self):
        
        #设置相关权重的初始值。也可以不设置。
        for name, p in self.named_parameters():
            
            if 'lstm' in name:
                if 'weight_ih' in name:
                    nn.init.xavier_uniform_(p.data)
                elif 'weight_hh' in name:
                    nn.init.orthogonal_(p.data)
                elif 'bias_ih' in name:
                    p.data.fill_(0)
                    # Set forget-gate bias to 1
                    n = p.size(0)
                    p.data[(n // 4):(n // 2)].fill_(1)
                elif 'bias_hh' in name:
                    p.data.fill_(0)
            elif 'fc' in name:
                if 'weight' in name:
                    nn.init.xavier_uniform_(p.data)
                elif 'bias' in name:
                    p.data.fill_(0)

    def forward(self, x):
        #x输入维度：(batch_size, time_step=80, input_size=5)
        #下面第一行的lstm可以自行调整h_0,c_0，在这里我们取默认值。
        out, _ = self.lstm1(x) #不使用两个记忆单元的数据
        out, _ = self.lstm2(out)
        out, _ = self.lstm3(out)
        out, _ = self.lstm4(out)
        out = self.fc1(out)
        out = self.selu(out)
        out = self.fc2(out)
        #out最后的维度(batch_size, 80, 1)

        return out

4.3 数据加载和训练、预测

我们利用torch自带的数据加载处理数据的输入。

class Dataset(torch.utils.data.Dataset):
    def __init__(self, X, y):
        if y is None: #真实预测没有y
            y = np.zeros(len(X), dtype=np.float32)

        self.X = X.astype(np.float32)
        self.y = y.astype(np.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, i):
        return self.X[i], self.y[i]

建立模型评估时的验证代码，封装在一个函数中，减少代码运行量。

import time

#切换使用GPU运行，让模型速度更快。
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
criterion = torch.nn.L1Loss() #绝对损失器

#模型评估函数
def evaluate(model, loader_val):
    tb = time.time()
    was_training = model.training
    model.eval() #要求模型处于验证状态。
  
    loss_sum = 0 #初始化验证损失值
    n_sum = 0 #记录样本量
    y_pred_all = [] #验证集的预测列表

    for ibatch, (x, y) in enumerate(loader_val):
        
        n = y.size(0)
        x = x.to(device) #将数据传输到GPU
        y = y.to(device)

        with torch.no_grad():
            #原本输出(batch_size, 80, 1)，使用squeeze(-1)，将最后一维的1压缩。
            #y_pred:(batch_size, 80)
            y_pred = model(x).squeeze(-1)

        loss = criterion(y_pred, y)

        n_sum += n
        loss_sum += n*loss.item() #loss是一个标量，使用item提取。
        #将数据从GPU传到cpu
        y_pred_all.append(y_pred.cpu().detach().numpy())

    loss_val = loss_sum / n_sum

    model.train(was_training)

    d = {'loss': loss_val,
         'time': time.time() - tb,
         'y_pred': np.concatenate(y_pred_all, axis=0)}

    return d

下面开始进入我们的训练代码，先设定好我们需要的常量。

#开始训练

from torch.optim.lr_scheduler import ReduceLROnPlateau #学习率调度器
from sklearn.model_selection import KFold #数据拆分

nfold = 5 #5折交叉
kfold = KFold(n_splits=nfold, shuffle=True, random_state=42)
epochs = 200 #由于我们数据量减少，此处可以设大一点。 
lr = 1e-3  #学习率
batch_size = 512 
max_grad_norm = 1000  #梯度裁剪
log = {}  #保存训练信息

for ifold, (idx_train, idx_val) in enumerate(kfold.split(X_all)):
    
    tb = time.time()
    model = Model(input_dim)
    model.to(device)
    model.train()
    #优化器
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    #学习率调度器
    scheduler = ReduceLROnPlateau(optimizer, factor=0.5, patience=10)
    #得到实际的训练集和验证集
    X_train = X_all[idx_train]
    y_train = y_all[idx_train]
    X_val = X_all[idx_val]
    y_val = y_all[idx_val]
    
    #使用torch加载数据
    dataset_train = Dataset(X_train, y_train)
    dataset_val = Dataset(X_val, y_val)
    loader_train = torch.utils.data.DataLoader(dataset_train, shuffle=True,
                         batch_size=batch_size, drop_last=True)
    loader_val = torch.utils.data.DataLoader(dataset_val, shuffle=False,
                         batch_size=batch_size, drop_last=False)

    losses_train = []  #训练集损失
    losses_val = []  #验证集损失
    lrs = [] #学习率变化
    time_val = 0   
    best_score = np.inf
   
    print(f'====folds:[{ifold+1} / {nfold}], Start training and validation===='