HTML Editing Simplified with Kotlin Script And Jsoup
Jsoup library to working with HTML files for reading them, extracting information from them, and modifying them
On rare occasions, I need to write a script to simplify some manual tasks I have to do either at work or in my personal project. Most of the time I used to write Python scripts just because it’s easy to get started. Python is not my preferred language so I would run into issues sometimes but it wasn’t difficult to find answers since Python is widely used. However, recently I came to know about Kotlin scripting and I decided to write my first Kotlin script to do a small HTML manipulation I wanted to do on multiple HTML files.
HTML manipulation can easily be done using Jsoup library in Kotlin scripts. I find the Jsoup official documentation really helpful and thorough. In this post, we explore how we can use Jsoup to read HTML, extract information from HTML, and modify HTML.
We will use a simple HTML page for this post purpose which looks like this:
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>We are in test HTML</h1>
<div>
<a href="https://www.google.com">Take me to Google</a>
<a href="https://www.spotify.com">Take me to Spotify</a>
<a href="https://www.youtube.com">Take me to Youtube</a>
</div>
</body>
</html>
A quick note on the Kotlin script
Make sure you have the correct Kotlin version installed. Scripts can be run using kotlinc -script <script-name>.main.kts
or just kotlin <script-name>.main.kts
. As you can see in these examples, make sure you name your script file with a suffix .main.kts
Reading HTML file
In Kotlin script you can provide the arguments when running the script. In this case, we will provide the path to the HTML file as an argument, and then in the script you can read the HTML file like this:
val htmlFile = Jsoup.parse(File("${directory.absolutePath}/test.html"), "UTF-8")
You can find elements in HTML using id, tag, class, and attribute. In this example let’s find an element using ID.
val titleH1 = htmlFile.getElementById("title")
This returns a nullable Element
object and once you have the object you can extract information.
Extracting information
Jsoup provides a few options to extract different things from the Element
object. The three ways we will be using here are element.html()
, element.text()
, and element.outerHtml()
For the test HTML in this example below are the outputs you will see for each
title?.html() and title?.text() => We are in test HTML
title?.outerHtml() => <h1 id="title">We are in test HTML</h1>
As you will see if you want just the content of the element use html()
or text()
and if you need the entire HTML use outerHtml()
Jsoup also provides a way to select multiple elements using selector syntax which will return an array of elements as Elements
You can iterate over these to perform any action.
Let’s see one example of using selector-syntax and reading using an attribute
val links = htmlFile.select("div > a")
val googleLink = links[0].attr("href") // this will return https://www.google.com
Modify HTML
Finally, let’s look at how you can update/modify HTML. Jsoup provides easy ways to modify an existing attribute, set text values to existing HTML, and append or prepend HTML. We will use the links we have in our test HTML to demonstrate all the ways we can modify HTML
// Modify attribute
links[0].attr("href", "https://www.gmail.com")
// Set new text value
links[0].text("Take me to Gmail")
// Append to existing div
val firstDiv = htmlFile.select("div").first()
firstDiv?.append("<a href=\"https://www.reddit.com\">Take me to Reddit</a>")
To make sure the changes are updated in your test file or create a new file add this at the end of your script
File("${directory.absolutePath}/updated.html").writeText(htmlFile.html())
Finally here’s the the full script we used here
@file:DependsOn("org.jsoup:jsoup:1.14.3")
import java.io.File
import kotlin.system.exitProcess
import org.jsoup.Jsoup
val directory = File(args[0])
fun run() {
println("Start....")
/**
* Reading HTML file
*/
val htmlFile = Jsoup.parse(File("${directory.absolutePath}/test.html"), "UTF-8")
val titleH1 = htmlFile.getElementById("title")
/**
* Extracting information
*/
println("titleH1?.html() => ${titleH1?.html()}, titleH1?.text() => ${titleH1?.text()}, titleH1?.outerHtml() => ${titleH1?.outerHtml()}")
val links = htmlFile.select("div > a")
val googleLink = links[0].attr("href")
println("Google link $googleLink")
/**
* Modify HTML
*/
// Modify attribute
links[0].attr("href", "https://www.gmail.com")
// Set new text value
links[0].text("Take me to Gmail")
// Append
val firstDiv = htmlFile.select("div").first()
firstDiv?.append("<a href=\"https://www.reddit.com\">Take me to Reddit</a>")
File("${directory.absolutePath}/updated.html").writeText(htmlFile.html())
println("End....")
exitProcess(0)
}
// Usage kotlin <script-name>.main.kts <directory where html file is>
// Example - kotlin kotlin-script-example.main.kts ~/Documents
run()
Final Thoughts
I found Jsoup very easy to use and provides extensive ways of playing with HTML files in Kotlin script. Jsoup can also be used in other Kotlin projects like an Android app by adding the gradle dependencies. Refer to the official docs of Jsoup for more information.
If you have tried Jsoup please share your experience and share any comments/feedback you have for my post.