博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
【PyTorch】5 姓氏生成RNN实战——使用语言生成名称
阅读量:3916 次
发布时间:2019-05-23

本文共 8791 字,大约阅读时间需要 29 分钟。

生成名称与字符级RNN

这是官方NLP From Scratch的一个教程(2/3),原,本文是其详细的注解

1. 准备数据

准备数据过程与上篇不同之处在于:

all_letters = string.ascii_letters + " .,;'-"n_letters = len(all_letters) + 1 # Plus EOS marker

此部分代码如下:

import unicodedataimport stringimport globimport osall_letters = string.ascii_letters + " .,;'-"    # abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .,;'n_letters = len(all_letters) + 1        # 59category_lines = {
}all_categories = []def unicodeToAscii(s): Ascii = [] for c in unicodedata.normalize('NFD', s): if unicodedata.category(c) != 'Mn' and c in all_letters: Ascii.append(c) return ''.join(Ascii)def findFiles(path): return glob.glob(path)def readLines(filename): lines = open(filename, 'r', encoding='utf-8').read().strip().split('\n') return [unicodeToAscii(line) for line in lines]path = '... your path\\data\\'if __name__ == '__main__': for filename in findFiles(path + 'names\\*.txt'): category = os.path.splitext(os.path.basename(filename))[0] all_categories.append(category) lines = readLines(filename) category_lines[category] = lines n_categories = len(all_categories) print('# categories:', n_categories, all_categories) print(unicodeToAscii("O'Néàl"))

结果:

# categories: 18 ['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean', 'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish', 'Vietnamese']O'Neal

2. 建立网络

此部分定义:

  • class model(nn.Module)
  • def randomChoice(l): # 从列表中随机选取一项
  • def randomTrainPair(): # 从类别中随机获得一个类别和随机的line

关于nn.Dropout(0.1)函数,带Dropout的网络可以防止出现过拟合,该层的神经元在每次迭代训练时会随机有10% 的可能性被丢弃

3. 准备训练

对于每个时间步(即,对于训练词中的每个字母),网络的输入将为(category, current letter, hidden state),而输出将为(next letter, next hidden state)。 因此,对于每个训练集,我们都需要类别,一组输入字母和一组输出/目标字母

关于find(str, beg=0, end=len(str)-1)函数,检测字符串中是否包含子字符串 str ,如果指定 beg(开始) 和 end(结束) 范围,则检查是否包含在指定范围内,如果包含子字符串返回开始的索引值,否则返回-1

此部分定义:

  • def categortTensor(category): # 类别的one-hot vector
  • def inputTensor(line): # 输入的首字母到尾字母的one-hot矩阵(不包括EOS)
  • def targetTensor(line): # 目标的第二个字母尾部(EOS)的LongTensordef randomTrainingExample(): # 随机提取(类别,行)对,并将其转换为所需的(类别,输入,目标)张量

4. 训练网络

.unsqueeze_(-1):增加一个维度

训练过程:

0m 17s (5000 5.0%) 2.97100m 34s (10000 10.0%) 1.97830m 51s (15000 15.0%) 2.78771m 8s (20000 20.0%) 2.62031m 24s (25000 25.0%) 2.91351m 41s (30000 30.0%) 2.57721m 58s (35000 35.0%) 2.46872m 14s (40000 40.0%) 2.58712m 31s (45000 45.0%) 2.15782m 48s (50000 50.0%) 2.11463m 6s (55000 55.00000000000001%) 2.27843m 22s (60000 60.0%) 1.74633m 39s (65000 65.0%) 2.55183m 56s (70000 70.0%) 1.37364m 13s (75000 75.0%) 2.30504m 29s (80000 80.0%) 2.41784m 46s (85000 85.0%) 2.38635m 3s (90000 90.0%) 3.19195m 20s (95000 95.0%) 2.62695m 37s (100000 100.0%) 2.3016

Loss变化如图所示:

在这里插入图片描述

5. 测试

print(sample('English','Y'))print(sample('English', 'S'))print(sample('English', 'C'))
YandeSantengChambennt
print(sample('Chinese','Y'))print(sample('Chinese', 'S'))print(sample('Chinese', 'C'))
YueShaCha
print(sample('Korean','Y'))print(sample('Korean', 'S'))print(sample('Korean', 'C'))
YouShoChun
print(sample('Russian','Y'))print(sample('Russian', 'S'))print(sample('Russian', 'C'))
YanhovShimhonChinhinh

6. 全部代码

import unicodedataimport stringimport globimport osall_letters = string.ascii_letters + " .,;'-"    # abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .,;'n_letters = len(all_letters) + 1        # 59category_lines = {
}all_categories = []def unicodeToAscii(s): Ascii = [] for c in unicodedata.normalize('NFD', s): if unicodedata.category(c) != 'Mn' and c in all_letters: Ascii.append(c) return ''.join(Ascii)def findFiles(path): return glob.glob(path)def readLines(filename): lines = open(filename, 'r', encoding='utf-8').read().strip().split('\n') return [unicodeToAscii(line) for line in lines]import torchimport torch.nn as nnclass Net(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(Net, self).__init__() self.hidden_size = hidden_size self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size) self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size) self.o2o = nn.Linear(hidden_size + output_size, output_size) self.dropout = nn.Dropout(0.1) self.softmax = nn.LogSoftmax(dim=1) def forward(self, category, input, hidden): input_combined = torch.cat((category, input, hidden), 1) hidden = self.i2h(input_combined) output = self.i2o(input_combined) output_combined = torch.cat((hidden, output), 1) output = self.o2o(output_combined) output = self.dropout(output) output = self.softmax(output) return output, hidden def initHidden(self): return torch.zeros(1, self.hidden_size)import randomdef randomChoice(l): # 从列表中随机选取一项 return l[random.randint(0, len(l) - 1)]def randomTrainPair(): # 从类别中随机获得一个类别和随机的line category = randomChoice(all_categories) line = randomChoice(category_lines[category]) return category, linedef categortTensor(category): # 类别的one-hot vector index = all_categories.index(category) tensor = torch.zeros(1, n_categories) tensor[0][index] = 1 return tensordef inputTensor(line): # 输入的首字母到尾字母的one-hot矩阵(不包括EOS) tensor = torch.zeros(len(line), 1, n_letters) for i in range(len(line)): letter = line[i] tensor[i][0][all_letters.find(letter)] = 1 return tensordef targetTensor(line): # 目标的第二个字母尾部(EOS)的LongTensor letter_indexes = [all_letters.find(line[i]) for i in range(1, len(line))] letter_indexes.append(n_letters - 1) # EOS return torch.LongTensor(letter_indexes)def randomTrainingExample(): # 随机提取(类别,行)对,并将其转换为所需的(类别,输入,目标)张量 category, line = randomTrainPair() category_tensor = categortTensor(category) input_line_tensor = inputTensor(line) target_line_tensor = targetTensor(line) return category_tensor, input_line_tensor, target_line_tensordef train(category_tensor, input_line_tensor, target_line_tensor): target_line_tensor.unsqueeze_(-1) hidden = model.initHidden() model.zero_grad() loss = 0 for i in range(input_line_tensor.size()[0]): output, hidden = model(category_tensor, input_line_tensor[i], hidden) # torch.Size([1, 18]) torch.Size([1, 59]) torch.Size([1, 128]) l = criterion(output, target_line_tensor[i]) loss += l loss.backward() for p in model.parameters(): p.data.add_(-learning_rate * p.grad.data) return output, loss.item() / input_line_tensor.size()[0]import timeimport mathdef timeSince(since): now = time.time() s = now - since m = math.floor(s / 60) s -= m * 60 return '%dm %ds' % (m, s)def sample(category, start_letter='A'): with torch.no_grad(): categort_tensor = categortTensor(category) input = inputTensor(start_letter) hidden = model.initHidden() output_name = start_letter for i in range(max_length): output, hidden = model(categort_tensor, input[0], hidden) topv, topi = output.data.topk(1) topi = topi[0][0] if topi == n_letters - 1: break else: letter = all_letters[topi] output_name += letter input = inputTensor(letter) return output_nameimport matplotlib.pyplot as pltpath = '... your path\\data\\'if __name__ == '__main__': for filename in findFiles(path + 'names\\*.txt'): category = os.path.splitext(os.path.basename(filename))[0] all_categories.append(category) lines = readLines(filename) category_lines[category] = lines n_categories = len(all_categories) model = Net(n_letters, 128, n_letters) # 以下为训练 # criterion = nn.NLLLoss() # learning_rate = 0.005 # # n_iters = 100000 # print_every = 5000 # plot_every = 50 # all_losses = [] # total_loss = 0 # Reset every plot_every iters # # start = time.time() # # for iter in range(1, 1 + n_iters): # output, loss = train(*randomTrainingExample()) # total_loss += loss # # if iter % print_every == 0: # print('{} ({} {}%) {:.4f}'.format(timeSince(start), iter, iter / n_iters * 100, loss)) # if iter % plot_every == 0: # all_losses.append(total_loss / plot_every) # total_loss = 0 # # torch.save(model.state_dict(), '... your path\\model_2.pth') # plt.figure() # plt.plot(all_losses) # plt.show() # 以下为测试 model.load_state_dict(torch.load('... your path\\model_2.pth')) max_length = 20 print(sample('Russian', 'Y')) print(sample('Russian', 'W')) print(sample('Russian', 'L'))

总结

上篇完成的是:

  1. 分类
  2. 每个单词的各个字母分别输入,最后取output

本篇完成的工作是:

  1. 预测与生成
  2. 多了一个类别的输入
  3. 每个单词的各个字母取output,与下一个字母计算损失

转载地址:http://ywtrn.baihongyu.com/

你可能感兴趣的文章
服务器应用服务为何卡顿?原来是内存耗尽惹的祸!
查看>>
什么?原来C#还有这两个关键字
查看>>
Mbp,一个用于学习.net core的开发框架
查看>>
【Magicodes.IE 2.0.0-beta1版本发布】已支持数据表格、列筛选器和Sheet拆分
查看>>
net下的高性能轻量化半自动orm+linq的《SqlBatis》
查看>>
如何利用Serilog的RequestLogging来精简ASP.NET Core的日志输出
查看>>
在 Blazor WebAssembly 中使用 gRPC-Web
查看>>
【实战 Ids4】║ 在Swagger中调试认证授权中心
查看>>
.NET Core开发实战(第10课:环境变量配置提供程序)--学习笔记
查看>>
WTM系列视频教程:View和Taghelper
查看>>
面试官:你连HTTP请求Post和Get都不了解?
查看>>
.NET Core 3.0 即将结束生命周期,建议迁移 3.1
查看>>
开源、免费、企业级的SiteServer CMS .NET CORE 7.0 预览版发布
查看>>
基于.NET下的人工智能|利用ICSharpCore搭建基于.NET Core的机器学习和深度学习的本地开发环境...
查看>>
【朝夕Net社区技术专刊】Core3.1 WebApi集群实战专题---WebApi环境搭建运行发布部署篇...
查看>>
200行代码,7个对象——让你了解ASP.NET Core框架的本质[3.x版]
查看>>
.NET Core开发实战(第21课:中间件:掌控请求处理过程的关键)--学习笔记(下)...
查看>>
对比Java和.NET多线程编程
查看>>
[头脑风暴] 解读Docker Bridge网络模型
查看>>
集成平台集群任务动态分派
查看>>