Text Generation Using LSTM: A Step-by-Step Guide
Text generation is a fascinating application of deep learning in natural language processing (NLP). It involves training a model on a given text dataset, which can then generate new, coherent sequences of text based on the patterns it has learned. In this blog, we will build a text generation model using LSTM (Long Short-Term Memory) networks in TensorFlow and Keras. We will go through each step, from tokenizing the text to training the model and generating new text.
What is LSTM?
LSTM is a type of Recurrent Neural Network (RNN) designed to better handle long-term dependencies in sequential data. Unlike standard RNNs, LSTMs use gates to control the flow of information, enabling them to retain information over long sequences, which is critical in tasks such as language modeling, machine translation, and text generation.
Why LSTM for Text Generation?
- Captures Long-Term Dependencies: In text generation, the model needs to learn from the entire sequence of words and make coherent predictions based on the context. LSTM’s ability to capture long-term dependencies makes it ideal for this task.
- Handles Variable Length Sequences: Text data often comes in sequences of varying lengths. LSTMs can handle these variable-length sequences efficiently.
- Prevents Vanishing Gradients: The gating mechanism in LSTMs prevents the vanishing gradient problem that standard RNNs suffer from, allowing them to learn from longer sequences.
Step-by-Step Guide to Text Generation with LSTM
Step 1: Import Libraries
We first import the necessary libraries from TensorFlow and Keras. These include LSTM
, Embedding
, Tokenizer
, and functions for text preprocessing and sequence padding.
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
Step 2: Prepare the Text Data
For this example, we’ll use a short excerpt from Shakespeare. You can replace this text with any dataset you’d like to train the model on. The text is tokenized, which means each word is mapped to an integer.
# Sample text (e.g., Shakespeare)
text = """Shall I compare thee to a summer's day? Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May, And summer's lease hath all too short a date."""
Step 3: Tokenize the Text
To train our model, we need to convert the text into a numerical format. This is done using Keras’s Tokenizer
, which converts the words in the text to sequences of integers. Each unique word is assigned a specific index, and the tokenizer creates a mapping of words to indices.
# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
tokenizer.fit_on_texts()
: Creates the word-to-index mapping.total_words
: Stores the number of unique words in the text (plus one for padding).
Step 4: Generate Input Sequences
We now need to create sequences of words for training the model. For this, we’ll generate n-gram sequences from the text. The sequence is progressively increased by one word at a time to form the training data. For example, the sentence “Shall I compare thee” will generate these sequences:
- “Shall I”
- “Shall I compare”
- “Shall I compare thee”
# Create sequences for text generation
input_sequences = []
for line in text.split('.'):
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
Each sentence is split into sequences of increasing length, creating the n-gram sequences necessary for training.
Step 5: Pad Sequences
The sequences generated in the previous step are of varying lengths. Since the model requires all input sequences to be of the same length, we pad them to a uniform size using pad_sequences()
. The padding is done at the beginning of the sequences (pre-padding) to ensure that the sequences are aligned properly.
# Pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
Step 6: Split Input and Output
Next, we split each sequence into input (X
) and output (y
). The input consists of all words in the sequence except the last one, and the output is the last word in the sequence. The output (y
) is one-hot encoded to allow the model to predict the next word from a vocabulary of all possible words.
# Split sequences into input (X) and output (y)
X, y = input_sequences[:, :-1], input_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)
X
: Contains the input sequences with one less word.y
: Contains the output word, which is one-hot encoded.
Step 7: Build the LSTM Model
Now, we build the LSTM model using Keras’s Sequential API. The model consists of the following layers:
- Embedding Layer: Converts the input word indices into dense vectors of fixed size.
- LSTM Layer: Processes the input sequences and learns the temporal relationships between words.
- Dense Layer: Outputs a probability distribution over the vocabulary using a softmax activation function, which predicts the next word in the sequence.
# Build LSTM model for text generation
model_textgen_lstm = Sequential()
model_textgen_lstm.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model_textgen_lstm.add(LSTM(100))
model_textgen_lstm.add(Dense(total_words, activation='softmax'))
- Embedding Layer: Takes the total number of words (
total_words
) as input and generates word embeddings of size 64. - LSTM Layer: Contains 100 units, which capture long-term dependencies in the text.
- Dense Layer: Outputs a softmax probability over the total words in the vocabulary.
Step 8: Compile and Train the Model
We compile the model using the Adam optimizer and categorical crossentropy as the loss function. Then, we train the model for 100 epochs.
# Compile model
model_textgen_lstm.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train the model
model_textgen_lstm.fit(X, y, epochs=100, verbose=1)
- Optimizer: Adam is a popular choice for its adaptive learning rate.
- Loss: Categorical crossentropy is used as we are dealing with a multi-class classification problem (predicting the next word from the vocabulary).
Step 9: Generate New Text
Once the model is trained, we can use it to generate new text. The function generate_text()
takes a seed text and generates a specified number of new words by predicting the next word repeatedly, updating the seed text with the new predictions.
# Generate new text
def generate_text(seed_text, next_words, max_sequence_len):
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = np.argmax(model_textgen_lstm.predict(token_list), axis=-1)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
return seed_text
Example: Generating Text
Let’s generate 5 new words using the seed text "Shall I compare thee"
.
# Example text generation
print(generate_text("Shall I compare thee", 5, max_sequence_len))
The model will predict and append five new words to the seed text. The generated text might look like:
Shall I compare thee to the summer's day and...
Key Considerations for Text Generation with LSTM:
- Data Size: This example uses a very short piece of text, but larger datasets will allow the model to learn better and generate more coherent text.
- Epochs: Training for more epochs can help the model learn more patterns from the text but may also lead to overfitting.
- Temperature: You can add randomness to the generated text by sampling from the softmax output (instead of picking the most likely word) using a technique called temperature sampling.
Conclusion
In this blog, we’ve built a simple text generation model using LSTM with TensorFlow and Keras. By training the model on a sample of text and generating new sequences, we’ve demonstrated how LSTMs can capture the structure of the language and generate coherent text. You can experiment with larger datasets, more epochs, or different model architectures to improve the quality of the generated text.
Stay tuned for more!
I am always happy to connect with my followers and readers on LinkedIn. If you have any questions or just want to say hello, please don’t hesitate to reach out.
https://www.linkedin.com/in/sharmasaravanan/
Happy learning!
Adios, me gusta!! 🤗🤗