KR 2026 homework part 2

Allikas: Lambda

This lab is primarily an exploration in prompt engineering a VLM to create structured representations.

Task

You will have to write a small system where a VLM (vision language model: most large commercial LLMs are actually VLMs) describes the contents of an ordinary room, using the same schema and principles you used yourself for the the first lab. The main part of the work is prompt engineering for a VLM to do a sensibly good job.

Your system should be able to automatically generate a significant portion of the data you yourself wrote in the first lab by looking at an image, by looking at the same image. It should also be able to generate a noticeable amount of sensible structured data about an image for other, similar images of rooms.

Crucially, your system should use the same principles / schema of knowledge representation you used yourself in the first lab. The logical reasoning you could do in the first lab should also be doable with the output of your system. The concrete choice of a representation language (json-ld or not) is up to you. No need to generate the knowledge in several representation languages.

You have to present at least three images of a room (including the one used in the first lab) along with the output of system. For the image from the first lab, compare the data written by you to the data created by your system.

Initial ideas and suggestions

The system could either:

  • Ask the VLM to produce the structured information using your schema/encoding directly, or
  • First ask the VLM to produce an intermediate representation and then convert the intermediate representation either programmatically or using an LLM to a suitable final representation. Here there are again several options:
    • Let VLM produce an English representation and then use an LLM to convert it to a final representation.
    • Let VLM produce some structured JSON not directly in the final format, and then programmatically (or with an LLM) convert to the final representation.

Explore yourself which way works best for you. It is probably easier to start with the first, direct approach.

Some tips about prompt engineering

Create a prompt which is a balanced mixture of instructions and examples.

The instructions should describe the task and the expected encoding. Give a nomenclature of predicates and explain how to represent what. Give several examples of room descriptions in the expected encoding. You can also give negative examples: output which is not OK.

Importantly, give several examples of actual images along with the expected description in the expected encoding.

NB! Do not use the same images / descriptions as in the prompt in your three demonstration image/description cases.

Some tips about writing code

First, it is highly recommended to use a VLM API for the task, not a web-based chat window (you cannot well write a system around a chat window). Look up the APIs. They may want you to use a credit card, although there are free ones available. The actual cost is going to be micoscopic, since you are not going to perform massive amounts of large API calls.

Look up the API pages of large LLM providers. Most probably you can use the free tier of Gemini: https://ai.google.dev/gemini-api/docs/pricing

When writing code, prefer doing it in a dialog with an LLM. Since the task is small, you can probably just use a chat window to describe the task, bring examples, get and debug the code. It is surely a bit better to use a proper coding tool like Claude code or VS Code plugin etc for the task, but it will be perfectly OK with just a chat window as well.

A tiny example

This example is without an API and without image / expected output examples, just using the Gemini web interface:

I attach this image to the question: https://turing.cs.ttu.ee/~tanel.tammet/room.jpg and ask:

Your job is to create a structured, logic-based description of the objects in a room 
and their relations on the attached image. 
Here is an example description for some other room: 

nextTo(sofa, lamp).
nextTo(cabinet, desk).
nextTo(cabinet, plant1).
attachedTo(cabinet, hanger).
on(desk, plant2).
on(desk, glass).
on(desk, papers).
on(papers, phone).
in(glass, water).
nextTo(desk, window).
nextTo(desk, curtains).
nextTo(window, curtains).
name(plant1, "Ficus benjamina").
name(plant2, "Spathiphyllum").
weight(cabinet, 90).

color(sofa, 'olive green').
color(curtains, blue).
color(plant1, green).
color(plant2, green).

plant(plant1).
plant(plant2).

nextTo(sofa, lamp).
nextTo(cabinet, desk).
nextTo(cabinet, plant1).
attachedTo(cabinet, hanger).
on(desk, plant2).
on(desk, glass).
on(desk, papers).
on(papers, phone).
in(glass, water).
nextTo(desk, window).
nextTo(desk, curtains).
nextTo(window, curtains).
name(plant1, "Ficus benjamina").
name(plant2, "Spathiphyllum").
weight(cabinet, 90).

color(sofa, 'olive green').
color(curtains, blue).
color(plant1, green).
color(plant2, green).

plant(plant1).
plant(plant2).

Gemini 3 Flash then answers me:

This is a beautifully designed, Japandi-style living room. It balances warm wood textures with minimalist, neutral tones.

Using the structured, logic-based format you requested, here is a description of the objects and their spatial relationships:

### Object Relations

* `on(floor, rug)`
* `on(rug, coffeeTable)`
* `on(rug, sofa)`
* `on(rug, chair1)`
* `on(rug, chair2)`
* `on(sofa, pillow)`
* `on(sofa, blanket)`
* `on(coffeeTable, glassVase)`
* `on(coffeeTable, decorativeBowl)`
* `in(glassVase, sprig)`
* `on(shelf, books)`
* `on(shelf, ceramicVases)`
* `nextTo(sofa, fireplace)`
* `nextTo(fireplace, ceramicVases)`
* `nextTo(chair2, tree)`
* `attachedTo(wall, painting)`
* `behind(tree, window)`
* `inside(fireplace, fire)`

### Attributes and Metadata

* `color(sofa, 'beige/oatmeal')`
* `color(chair1, 'cognac leather')`
* `color(chair2, 'cognac leather')`
* `color(coffeeTable, 'natural oak')`
* `color(wall, 'off-white')`
* `color(rug, 'light grey')`
* `material(wallPanels, wood)`
* `material(floor, wood)`
* `style(room, 'Japandi')`
* `name(tree, "Juniper Bonsai-style")`

Would you like me to generate a similar logic-based description for a different room, or perhaps suggest some furniture pieces that would complement this specific aesthetic?

Now, the system you have to build should take an image (or image url) as an argument, form a suitable API call and give back a directly usable logic-structure data, without any "This is a beautifully designed, ...." or "Would you like me ..." or *-starting bullets or backticks etc. You can achieve this is combination with suitable instructions in the prompt and simple programmatic cleaning of the resulting text.

API calling example

Here is an example of calling the gemini API directly with text (not images). NB! This code does not run "as is": it assumes simple functions like llm_error(...), utils.debug_print(...), _gemini_supports_thinking(version) are available. It is recommended to use an LLM to generate your code. You can, for example, give LLM this code and ask to convert it to a directly runnable code where all the necessary utilities are replaced or implemented. Also, ask it to modify the code for sending images, either by upload or pointing to an url.


def call_gemini(version, sentences, sysprompt, max_tokens):
  try:
    sf = open(gemini_secrets_file, "r")
    key = sf.read().strip()
    sf.close()
  except:
    return llm_error("Could not read Gemini API key file: " + str(gemini_secrets_file))

  baseurl = "/v1beta/models/" + version + ":generateContent"
  genconfig = {
    "maxOutputTokens": max_tokens,
    "temperature": temperature
  }
  if _gemini_supports_thinking(version):
    genconfig["thinkingConfig"] = {"thinkingLevel": "medium"}
  call = {
    "contents": [{"parts": [{"text": sentences}]}],
    "generationConfig": genconfig
  }
  if sysprompt:
    call["system_instruction"] = {"parts": [{"text": sysprompt}]}

  utils.debug_print("gemini call", call, flag=calldebug)
  calltxt = json.dumps(call)

  trycount = 0
  while True:
    host = "generativelanguage.googleapis.com"
    conn = http.client.HTTPSConnection(host, timeout=timeout)
    try:
      conn.request("POST", baseurl, calltxt,
                   headers={"content-Type": "application/json", "x-goog-api-key": key})
      response = conn.getresponse()
    except KeyboardInterrupt:
      raise
    except:
      trycount += 1
      if conn: conn.close()
      if trycount > max_retries:
        return llm_error("Gemini connection failed after " + str(max_retries) + " retries")
      print("Gemini connection failure, retrying...")
      time.sleep(sleepseconds * trycount)
      continue
    if response.status != 200 or response.reason != "OK":
      message = ""
      try:
        data = json.loads(response.read())
        if "error" in data and "message" in data["error"]:
          message = ": " + data["error"]["message"]
      except:
        pass
      trycount += 1
      if conn: conn.close()
      if trycount > max_retries:
        return llm_error("Gemini API error " + str(response.status) + " " + str(response.reason) + message)
      print("Gemini API failure, retrying:", str(response.status), str(response.reason) + message)
      time.sleep(sleepseconds * trycount)
    else:
      break

  rawdata = response.read()
  conn.close()
  try:
    data = json.loads(rawdata)
  except KeyboardInterrupt:
    raise
  except:
    return llm_error("Gemini response is not valid JSON: " + str(rawdata))

  if "candidates" not in data:
    return llm_error("Gemini response has no candidates: " + str(rawdata))
  data = data["candidates"][0]
  if "content" not in data:
    return llm_error("Gemini response has no content: " + str(data))
  data = data["content"]
  if "parts" not in data:
    return llm_error("Gemini response has no parts: " + str(data))

  utils.debug_print("gemini response:", data, flag=debug)
  res = ""
  for el in data["parts"]:
    if "text" in el:
      res += el["text"].strip()
  return res