Introduction
Have you ever ever questioned about how the characters of your favorite internet sequence lived after the top of the sequence? If sure, then this weblog will help you in constructing a script generator that may generate a script for a brand new episode. Our mannequin will likely be skilled on the scripts of all of the episodes and able to generate the script of the subsequent episode, which has not been produced within the sequence. For generations now, crafting storylines, compelling dialogue, and the whole lot of scripts has been the area of people. Nonetheless, this course of is commonly time-consuming and depends closely on the collaboration of a number of people, particularly when growing scripts for long-running sequence reminiscent of ‘Brooklyn 9-9.’ Therefore, on this weblog, we are going to construct a script generator utilizing Generative AI, which is able to help the screenwriters in writing the scripts as quickly as attainable. Â

Now, let’s perceive the definition of the expertise we are going to use on this weblog. Generative AI is a subset of synthetic intelligence able to producing new pictures, audio, video, textual content, and many others. We are actually utilizing Generative AI in nearly all fields to optimize the time required to complete a particular process. When speaking about textual information, Generative AI can generate human-like texts. It will possibly perceive the duty’s context and generate textual content based mostly on that. Within the internet sequence script era context, we are able to use Generative AI to generate a brand new episode script with the identical writing model and tone as the entire internet sequence.
Studying Aims
- We are able to perceive how AI will likely be utilized in content material writing for internet sequence, motion pictures, and many others.
- We are going to study the detailed strategy of constructing a script generator mannequin, together with information scraping, cleansing, mannequin constructing, and many others.
- We are going to study the power of generative AI in scrip writing, how effectively to jot down and its benefits.
- We are going to find out how essential making ready and cleansing information is and the way this impacts the script generator.
This text was revealed as part of the Knowledge Science Blogathon.
The Means of Creating Scripts Utilizing Generative AI
First, let’s briefly overview the complete move for constructing the script generator:
Gathering Internet Collection Knowledge
As everyone knows, we should collect information earlier than constructing any mannequin. So right here, in constructing the AI Script Generator, we first want to gather all the information concerning the scripts of the online sequence. This course of consists of amassing many scripts from explicit episodes of an internet sequence. We Scrape these utilizing scrapping web sites, via databases, or by looking for permissions from the proprietor of the scripts. The principle goal is to construct an enormous dataset with a variety of dialogues, communications between the characters, the event of explicit scenes, or the twists current within the sequence. As we develop the dataset, we should be sure that the information we accumulate is true, has no copyright points, and is full.
Cleansing and Pre-processing the Knowledge
Knowledge Pre-Processing is a vital step that ensures our information is clear and tidy. This step includes eradicating pointless information, reminiscent of stage instructions or director’s descriptions. Since we’re amassing information via internet scraping, we have to examine for any lacking information. We may also must normalize the textual content information by eradicating punctuation and particular characters and changing all of the phrases to lowercase. On this method, we are going to clear our dataset.
Knowledge Preparation
After completely cleansing the dataset, it’s time to arrange it as per our mannequin wants. First, we begin by tokenizing the script into particular person phrases utilizing a Tokenizer. This tokenizer breaks the entire sentence or a scripted dialogue into particular person phrases after which assigns a novel index worth, forming a phrase index. Following that, we create sequences of tokens. So, we create an inventory of tokens for every dialogue within the script. After tokenization, we pad these sequences with zeros at the start in order that the enter is uniform for our mannequin. Then, the final phrase of every sequence is used as a label to foretell the subsequent phrase. Lastly, the labels are transformed to categorical format utilizing one-hot encoding. On this method, the dataset is ready for mannequin coaching.
Constructing Generative Mannequin
As soon as the information is ready, we’re able to construct our Generative Mannequin. We want a mannequin to deal with sequential information for the textual content era process. This weblog will use a transformer-based mannequin to generate the scripts. On this coaching section, our mannequin will study to foretell the subsequent phrase based mostly on the earlier phrases. After the mannequin is skilled, we are able to assess the standard of the mannequin’s prediction utilizing a loss perform, reminiscent of cross-entropy loss.
Producing New Script
As soon as our mannequin is skilled, we are able to generate a brand new episode script. To do that, we first must feed the mannequin with an preliminary sentence named ‘seed.’ The mannequin then predicts the subsequent phrase based mostly on this seed sentence. The mannequin generates the subsequent phrase based mostly on the possibilities realized throughout coaching. This predicted phrase is added to a sequence, after which this course of is repeated till the specified size of the script is reached.
Advantages of Utilizing Generative AI in Scriptwriting
Listed below are the advantages of utilizing Generative AI in scriptwriting:
- As we mentioned earlier, scriptwriting is a time-consuming course of, as it’s performed manually by human writers. Nonetheless, with the usage of Generative AI, we are able to velocity up the method by producing preliminary drafts.
- One of many primary advantages of utilizing AI to generate a script is that it will possibly keep the writing model and tone of the earlier scripts on this new script.
- Generative AI can generate inventive and attention-grabbing dialogues throughout script era that may not happen to human writers.
- It helps scriptwriters to spend their time refining and perfecting the script reasonably than writing it from scratch.
Challenges of Utilizing Generative AI in Scriptwriting
Listed below are the challenges:
- The primary problem that one would possibly face whereas constructing this AI script generator is information assortment. We must always examine for any copywriter points.
- Generative AI can need assistance understanding the script’s context, which might result in consistency within the storyline.
- Though Generative AI can generate scripts very quickly, it will possibly lack the extent of creativity and originality of a human author.
- One of many primary challenges is that Generative AI requires quite a lot of computational energy, which might be costly.
Now, let’s dive deep into understanding the code behind this AI script generator within the subsequent few sections.
You possibly can execute all codes by clicking on the ‘Copy & Edit‘ button on this hyperlink.
First, Let’s import all of the libraries we are going to use to construct the script generator.
import requests
from bs4 import BeautifulSoup
import re
from nltk.tokenize import sent_tokenize
import plotly.specific as px
from collections import Counter
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.textual content import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset,
DataCollatorForLanguageModeling
from transformers import Coach, TrainingArguments
About Dataset
We are going to use the scripts of the Brooklyn 99 internet sequence. Because it has many episodes, it is going to be good for our mannequin. We are going to use the BeautifulSoup and Requests libraries to scrape these scripts from an internet web page.
We are going to do that through the use of two features specifically, ‘fetch_and_preprocess_scripts’, and ‘preprocess_text’.
The primary perform takes a URL as a parameter. This URL is the online web page from the place we are going to scrape our scripts. We are going to use the requests library to ship an HTTP request to get the HTML content material of the web page. We are going to then use the BeautifulSoup library to parse this HTML content material. We attempt to discover all anchor tags (<a>) with the category ‘topictitle’ as they include all of the hyperlinks to particular person episode scripts. Then, we assemble the complete URL of every script by concatenating it to the bottom URL, and we are going to retailer it in an inventory. This record is then reversed to take care of the order of the episodes. Lastly, the perform then iterates over every script and extracts the textual content. The textual content is then appended to a closing script string.
# Perform to fetch and preprocess the script content material from a given URL
def fetch_and_preprocess_scripts(url):
base_url = "https://transcripts.foreverdreaming.org"
response = requests.get(url)
soup = BeautifulSoup(response.content material, "html.parser")
anchor_tags = soup.find_all("a", class_="topictitle")
hyperlinks = [base_url + tag["href"][1:] for tag in anchor_tags]
hyperlinks = hyperlinks[2:]
hyperlinks.reverse()
final_script = ""
for hyperlink in hyperlinks:
response = requests.get(hyperlink)
soup = BeautifulSoup(response.content material, "html.parser")
script_div = soup.discover("div", class_="content material")
script_text = script_div.get_text(separator="n") if script_div else ""
final_script += script_text.strip() + "n"
preprocessed_script = preprocess_text(final_script)
return preprocessed_script
Now, we are going to name the preprocess_text perform, which is able to clear the script string by eradicating all HTML tags and sq. brackets, tokenizing the textual content into sentences, and changing the sentences to lowercase. Â
# Perform to scrub and preprocess the textual content
def preprocess_text(textual content):
cleaned_text = re.sub(r'<[^>]+>', '', textual content)
cleaned_text = re.sub(r'[[^]]+]', '', cleaned_text)
sentences = sent_tokenize(cleaned_text)
preprocessed_text=" ".be a part of(sentence.decrease() for sentence in sentences)
return preprocessed_text
url = "https://transcripts.foreverdreaming.org/viewforum.php?f=429&sid=
acbdaf84cb954f2929838f627cb124cb&begin=78"
newpreprocessed_script = fetch_and_preprocess_scripts(url)
url1 = "https://transcripts.foreverdreaming.org/viewforum.php?f=429"
new_preprocessed_script = fetch_and_preprocess_scripts(url1)
preprocessed_script = newpreprocessed_script+new_preprocessed_script
On this method, we scraped and cleaned the episodes’ scripts. Now, we’re prepared with our dataset.
Now let’s see the primary 500 phrases of our dataset, which ought to be the beginning phrases of the sequence’ pilot episode.
print(preprocessed_script[:500])
Output:

Exploratory Knowledge Evaluation
On this part, we are going to carry out Exploratory Knowledge Evaluation (EDA) on our script information. We are going to start by splitting the preprocessed script into particular person tokens (phrases). Then, we are going to depend the frequency of every token utilizing the Counter() perform.
tokens = preprocessed_script.cut up()
token_counter = Counter(tokens)
Now, we are going to extract the highest 20 widespread tokens and their counts.
most_common_tokens = token_counter.most_common(20)
token_labels, token_counts = zip(*most_common_tokens)
Lastly, we are going to create a DataFrame from this information to visualise it utilizing a bar chart. On the x-axis, we are going to preserve the phrases, and on the y-axis, we are going to preserve the depend of every phrase.
information = {'Phrase': token_labels, 'Frequency': token_counts}
df = pd.DataFrame(information)
fig = px.bar(df, x='Phrase', y='Frequency', title="Most Widespread Phrases")
fig.update_xaxes(tickangle=45)
fig.present()
Output:

One other technique to visualize essentially the most used phrases is to make a phrase cloud, a extra visually interesting chart. We have to name the WordCloud() perform and cross a couple of particulars concerning the chart, like width, peak, and background_color, together with the entire script.
textual content=" ".be a part of(tokens)
wordcloud = WordCloud(width=800, peak=400, background_color="white").generate(textual content)
plt.determine(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Phrase Cloud of Script Phrases')
plt.present()
Output:

Knowledge Preparation
Now, we are going to begin making ready our dataset to be appropriate for coaching the mannequin. This includes tokenizing the preprocessed script into particular person phrases utilizing the Tokenizer() perform. First, we are going to create the thing of Tokenizer, after which we are going to use the fit_on_tests() perform to tokenize. Lastly, we are able to get the whole variety of distinctive phrases used later.
# Tokenizing the textual content into phrases
tokenizer = Tokenizer()
tokenizer.fit_on_texts([preprocessed_script])
total_words = len(tokenizer.word_index) + 1
Now, we are going to create sequences of those tokens for each line within the script. We are going to do that by iterating over each line of the script, wherein we are going to convert every line to a sequence of tokens after which create a n-gram sequence from these tokens. Lastly, we are going to append these sequences to an inventory of enter sequences.
# Creating sequences of tokens
input_sequences = []
for line in preprocessed_script.cut up('n'):
token_list = tokenizer.texts_to_sequences([line])[0]
for i in vary(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
The subsequent step within the information preparation section is to create padding for the enter sequences. We do that to make sure that the enter sequence is uniform in size. To do that, we are going to name the ‘pad_sequences’ perform from the Keras library, wherein we are going to cross the input_sequences variable and the size of the longest sequence.
# Padding sequences to make sure uniform size
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
Lastly, we are going to cut up every sequence into labels and predictors. The predictors will include all of the phrases of the sequence besides the final phrase, and the label variable will include the final phrase of the sequence. We do that to coach the mannequin to foretell the subsequent phrase based mostly on the label variable. The labels are then transformed to a categorical format, which is important for coaching the mannequin.
# Creating predictors and labels
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
# Changing labels to categorical format
from tensorflow.keras.utils import to_categorical
label = to_categorical(label, num_classes=total_words)
print("Whole phrases:", total_words)
print("Max sequence size:", max_sequence_len)
print("Variety of enter sequences:", len(input_sequences))
OUTPUT:

Mannequin Constructing
Now, for the principle part of the entire constructing strategy of this AI script generator, we are going to use the pre-trained GPT-2 mannequin from the transformer library.
First, we are going to load the tokenizer and mannequin utilizing the from_pretrained() perform.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
mannequin = GPT2LMHeadModel.from_pretrained("gpt2")
The subsequent step is tokenizing and encoding the script information. That is performed utilizing the tokenizer.
preprocessed_script_tokens = tokenizer(preprocessed_script, return_tensors="pt", max_length=1024,
truncation=True)
Now, we are going to save the tokenized information right into a textual content file.
file_path = "preprocessed_script.txt"
with open(file_path, "w") as f:
f.write(preprocessed_script)
Now, we are going to convert the tokenized information right into a PyTorch dataset utilizing the TextDataset() class from the transformers library.
dataset = TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, multilevel marketing=False)
Now, we are going to outline coaching arguments utilizing the TrainingArguments class. The principle arguments are the variety of coaching epochs and batch dimension.
training_args = TrainingArguments(
output_dir="./script_generator",
overwrite_output_dir=True,
num_train_epochs=50,
per_device_train_batch_size=4,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
report_to=[], # Disabled wandb logging
)
Now, we are going to create the Coach object to cross the mannequin, training_args, data_collator, and dataset variable we’ve got created to date.
coach = Coach(
mannequin=mannequin,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
Lastly, we are going to practice the mannequin utilizing the practice() perform.
coach.practice()
This may practice the mannequin to generate scripts within the model of the preprocessed script information.
Producing Scripts
Now that the mannequin is skilled, we are able to generate a brand new episode’s script by loading the skilled mannequin and tokenizer.
# Loading the fine-tuned mannequin
mannequin = GPT2LMHeadModel.from_pretrained("/kaggle/working/fine_tuned_script_generator")
# Loading the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
Now, we are going to outline a seed sentence, which is able to function a place to begin for the brand new script.
# Producing textual content
prompt_text = "Detective Jake Peralta enters the precinct and pronounces:" # Customized immediate textual content
input_ids = tokenizer.encode(prompt_text, return_tensors="pt")
Now, we are going to generate the script utilizing generate() perform, wherein we are going to cross, the seed sentence, and the utmost size of the brand new script. One other parameter we are going to cross is temperature, which controls the randomness of the predictions.
# Producing textual content with a most size of 500 tokens
output1 = mannequin.generate(input_ids, max_length=500, num_return_sequences=1, temperature=0.7,
do_sample=True)
Lastly, we are going to decode the generated script and format it right into a extra readable format.
# Decoding the generated textual content
generated_text = tokenizer.decode(output1[0], skip_special_tokens=True)
delimiters = [". ", "? ", "! ", "| "]
for delimiter in delimiters:
generated_text = generated_text.substitute(delimiter, delimiter + "n")
# Printing every dialogue on a brand new line
print(generated_text.strip())
Output:

As you’ll be able to see, our mannequin generated a brand new episode script, which could be very correct and attention-grabbing.
Conclusion
In conclusion, Generative AI is a strong device for script era. It will possibly create a brand new script that matches the tone of a selected internet sequence wherein the mannequin is skilled. It will possibly scale back the effort and time of human writers. The standard of the generated scripts depends upon the standard and amount of the dataset. It additionally depends upon the selection of the mannequin and its parameters. Regardless of these challenges, script writers can use a script generator as an preliminary draft of an episode, and so they can refine it to their wants decreasing the whole time of script writing.
Key Takeaways
- Generative AI could be a helpful expertise for scriptwriters as they will construct and use script mills with it.
- To construct this script generator, we should always first collect an internet sequence dataset, put together the dataset in an appropriate format, construct the mannequin, and generate a brand new episode script.
- One of many primary challenges whereas constructing this script generator is that it requires quite a lot of computational energy.
- The standard of the newly generated script depends upon the amount and high quality of the coaching information.
The media proven on this article are usually not owned by Analytics Vidhya and is used on the Creator’s discretion.
Continuously Requested Questions
A. Generative AI is a expertise able to creating new issues, reminiscent of pictures, songs, movies, textual content, and many others.
A. It will possibly assist the scriptwriters by producing a brand new episode script as an preliminary draft. Scriptwriters can use this draft to refine it to their wants and make the ultimate draft, decreasing the whole time to jot down the script.
A. The necessity for computational sources is among the primary challenges in constructing the script generator.
A. No, Generative AI can’t utterly substitute scriptwriters but. However it will possibly assist script writers to jot down a brand new script in a short while.


