Connection-less Transport Management for Long Run Sessions

April 29, 2024

11 Minute Read

By Jérémy Audiger, IoTerop Principal Software Engineer

Have you ever wondered why the connection between your device and your server seems intermittently disrupted? This can be frustrating at times, as lost connections reduce overall solution efficiency such as battery lifespan and cloud connection costs and can have multiple causes.

However, this behavior is often perfectly normal, and it is due to the way Network Address Translation (NAT) works. NAT typically maintains an internal table (called a “NAT table”) to map your device’s internal IP to the external IP address of your router. The NAT table is used to route the response from one side to the other.

Like all systems in the world, devices are built on physical hardware, and hardware has its limitations. NAT tables can’t be infinite, and can therefore struggle to maintain the mappings over a long period of time. A timeout exists to clean the mappings stored to help NAT effectively manage their RAM/CPU consumption. Referring to the [RFC 4787 (chapter 4.3. Mapping Refresh)], this timeout is typically set to 2 minutes and can vary depending on the NAT implementation.

NAT mapping timeout implementations vary, but include the timer’s value and the way the mapping timer is refreshed to keep the mapping alive.

The mapping timer is defined as the time a mapping will stay active without packets traversing the NAT. There is great variation in the values used by different NATs.

REQ-5: A NAT UDP mapping timer MUST NOT expire in less than two minutes, unless REQ-5a applies.

If your device or server fails to send any packet during the timeout period, its corresponding entry in the NAT table will be marked for cleaning. This is why the link can be broken over time or after a long period of inactivity.

Solutions

Now that the problem has been identified, how can we solve it? Let’s go over a few options in detail in the following section.

Raw UDP

User Datagram Protocol (UDP) is a connection-less protocol, meaning that peers using this protocol do not maintain a continuous connection. In this sense, UDP can be considered a stateless protocol. On the other hand, Transmission Control Protocol (TCP) is a connection-oriented protocol. Peers using this protocol maintain a connection between each other through various messages.

By default, UDP does not provide a mechanism to maintain the connection. However, several simple to complex mechanisms do exist. If we take raw UDP (meaning sending packets without doing anything on top of it), the only way to maintain the connection is to send packets periodically in the time-frame of the NAT table timeout (more or less, 2 minutes). This mechanism is called “keep-alive,” and can be seen as a ping-pong between the different protagonists.

The drawback of this method is the consumption of the bandwidth. If you send a packet every 2 minutes, it means that you will consume 30 packets per hour, 720 packets per day, 21,600 packets per month, and so on. The number of packets sent can grow very quickly and can be a problem for an application running on constrained networks.

Let’s take a real-world example from the Internet of Things: IoT devices are constrained, and rely on less-than-perfect networks. The latencies can be high while the throughput is often low. Due to the nature of UDP, the packets could be lost, or simply take time to be delivered to the other side. If you send a packet every 2 minutes, and one of the packets is dropped by the network, the connection could end up being closed by the NAT table. This happens frequently on constrained networks such as LoRaWAN, Sigfox, NB-IoT, and so on.

Now, let’s see how this problem is handled by a well-known IoT protocol: the Lightweight Machine to Machine (LwM2M) protocol.

LwM2M provides a registration mechanism to maintain a shared context on a LwM2M Client and a LwM2M Server. The registration is based on the concept of lifetime, expressed in seconds. The lifetime is a value sent by the LwM2M Client to the LwM2M Server, and represents the time that the registration context is valid. To maintain this context, the LwM2M Client needs to send a registration message to the LwM2M Server before the end of the lifetime. If the LwM2M Client does not send a registration message before the end of the lifetime, the LwM2M Server will consider the LwM2M Client as deregistered. Each time the LwM2M Server receives a registration update message, it will reset the timer of the lifetime.

To prevent the problem seen with UDP without any other mechanism, the lifetime is usually set to a value that is lower than the NAT table timeout. However, this is not acceptable for reasons explained above.

Let’s take an example with real messages exchanged between a LwM2M Client and a LwM2M Server:

1. The LwM2M Client sends a registration message to the LwM2M Server:

POST /rd?ep=urn:imei:1234567890&lt=120&lwm2m=1.1&b=U
Object lists: </1/0>;ver=1.1,</3/0>;ver=1.1,</5/0>,</3303/0>

2. The LwM2M Server sends an acknowledgement message to the LwM2M Client:

2.01 Created
Location-Path: </rd/1234>

3. The LwM2M Client sends a registration update message to the LwM2M Server:

PUT /rd/1234

4. The LwM2M Server sends an acknowledgement message to the LwM2M Client:

2.04 Changed

We won’t go deep into the details of the messages sequence, but we will focus on the exchange cost in terms of bandwidth: if a LwM2M Client sends a registration update message every 2 minutes, and the LwM2M Server acknowledges the message, it means in this example that we’ll have 13 bytes sent by the LwM2M Client, and 5 bytes sent by the LwM2M Server. The total is 18 bytes. The first exchange is intentionally omitted, since it could be considered as a sort of “upfront cost” and does not contribute to the maintenance of the connection in the long run.

But what happens if the LwM2M Client does not send a registration update before the mapping on the NAT table expires? If the lifetime is set, for example, to 1 hour, the LwM2M Client does not send a registration update every 2 minutes. The only real impact in this scenario is that the LwM2M Server will not be able to send any messages to the LwM2M Client after the expiration of the NAT table mapping. The LwM2M Client will first need to send a registration message to the LwM2M Server to let the server associate the correct UDP peer with its LwM2M registration context. From there, the server will have a time-frame of 2 minutes to send messages. Apart from that, the LwM2M Client will not notice that much of impact. The only requirement to resume the communication with the server is to send a registration message after the expiration of the NAT table mapping.

UDP + DTLS

Datagram Transport Layer Security (DTLS) is a protocol that is based on UDP. It provides the same security guarantees as TLS. The protocol is defined in the [RFC 6347].

When a peer wants to establish an encrypted connection with another peer, they need to perform a handshake to negotiate the security parameters. The handshake is composed of the following messages (not exhaustive, since it can depend on the security parameters negotiated):

Once a DTLS session is established, the peers can send encrypted messages to each other. A session is closed when one of the peers sends a close notify message.

The DTLS server associates a DTLS client through its IP address and port. It maintains a DTLS session per DTLS client, meaning the client needs to always use the same IP address and port to communicate with the server. If the IP address or port changes, the DTLS server will not be able to associate the DTLS client with its DTLS session. When the server receives a message from an unknown DTLS client, it works on the assumption that this message is a client “hello” message. If this is not the case, the server will either drop the message silently, or emit an alert message before closing the session.

One of the mechanisms to resume a security session is called session resumption, which allows a client and server to establish a new secure session by reusing the security parameters from a previous session. This can significantly reduce the overhead of establishing a new session, as it avoids the need for a full handshake. Instead, the client sends a session ID to the server, and if the server recognizes the session ID and is willing to reuse the session parameters, it can resume the session:

Compared to the previous handshake, the client does not send a client key exchange message, and the server does not send a server “hello done” message.

It is recommended that the DTLS Client be the one initiating the session resumption mechanism after a long period of inactivity (meaning no communication between the peers). Such mechanisms can’t be triggered by the DTLS Server. The latter can only accept or reject the incoming session resumption requests. It’s only then that the DTLS Client can re-establish the session.

Finally, this mechanism must be used by the DTLS client to inform the DTLS server that the IP address and/or port has changed. It’s the only way to associate the correct IP address and port couple to the DTLS session for the server.

By reusing the previous example from Raw UDP, nothing much has changed for the LwM2M Client. Instead of just sending a registration update message, the client first needs to initiate a session resumption mechanism before sending its registration update message. If no messages are sent and the NAT table mapping is cleaned, the server will not be able to send outbound messages to the client, and will therefore have the same consequences as before. The only difference is that instead of “only” sending 18 bytes (i.e. the registration update message and the acknowledgement of the message), the Client will need to send:

- - Client Hello with Session ID (132 bytes)
  - Client Hello with Session ID + Cookie (164 bytes): optional step
  - Change Cipher Spec (14 bytes)
  - Encrypted Handshake Message (53 bytes)
  - Registration Update Message (13 bytes without encryption overhead)

=> Total of 363 bytes

And the Server:

- - Hello Verify Request (60 bytes): optional step
  - Server Hello (111 bytes)
  - Change Cipher Spec (14 bytes)
  - Encrypted Handshake Message (53 bytes)
  - Registration Update Message Acknowledgement (5 bytes without encryption overhead)

=> Total of 238 bytes

The total is 619 (363 + 238 + 18) bytes, which is approximately 34 times more than the previous example. Please note that we don’t count the encryption overhead of the messages exchanged by both applications to simplify the example. In the end, there will be even more bytes sent over the network. This is the cost to obtain a secure connection between entities over the time. The overhead is huge and can be a major problem for constrained networks.

Note: The example above is based on Pre-Shared Key (PSK) cipher suite with the ciphersuite: TLS_PSK_WITH_AES_128_CCM_8.

UDP + DTLS + Connection Identifier

What can the DTLS protocol offer to reduce the overhead of the session resumption mechanism? The answer is the Connection Identifier (CID), which is the topic of this section.

The Connection Identifier has been proposed in the [RFC 9146] as an extension to DTLS 1.2 protocol before being fully integrated in DTLS 1.3 protocol. The Connection Identifier is composed of at most a 32 bytes value that is used to identify a DTLS session. The value is generated by the DTLS Server, and is sent to the DTLS Client during the handshake. The DTLS server will then use this value to associate the DTLS session. The Connection Identifier is sent clear, not encrypted, and is part of the DTLS record layer. Basically, the DTLS Client needs to attach this value to each encrypted message sent. Extract from the RFC 9146:

The extension is negotiated during the handshake and is first exposed by the DTLS Client in the Client Hello message. If the DTLS Server supports the extension, it will reply with the same extension in the Server Hello message + the Connection Identifier value. If one of the parties does not support the extension, the DTLS session will be established without the Connection Identifier.

The Connection Identifier eliminates the need for session resumption mechanisms since it allows the DTLS session to be decoupled from the IP address and port. The DTLS Client can change its IP address and port, and the DTLS Server will still be able to associate the DTLS session with the DTLS Client. It reduces the overhead of the costly session resumption mechanism.

To state it briefly, the DTLS Client needs to simply attach this identifier to each encrypted message so long as the DTLS session is not interrupted by a Close Notify message.

Let’s return to our first example to see what the impact of a session resumption would be in this case. We can now mimic the same behavior as Raw UDP, but without the cost of a session resumption mechanism. Of course, we still have the overhead of DTLS, but it could be considered acceptable since we need to encrypt the messages exchanged. Sending messages without any encryption on the wire is not acceptable for the major IoT use cases due to cybersecurity considerations.

The only downside in this scenario is the addition of the Connection Identifier to each message sent by the DTLS Client to the DTLS Server. Compared to “regular” DTLS, the encrypted messages are a bit bigger, but it’s not a big deal if we take a common IoT use case of an IoT Device that is sleeping most of the time and wakes up periodically to send data to the server. Comparing between “regular” DTLS and DTLS with a Connection Identifier, we eliminate a lot of messages exchanged and bytes sent on the wire.

Please note that in this scenario, only the DTLS Client needs to attach the Connection Identifier to the messages. The DTLS Server does not need to do it, and thus, we do not have any additional overhead on the server side.

UDP + OSCORE

A good way to sum up the challenge of connecting IoT devices to the cloud as a programmer is we are trying to bridge two opposed worlds: IoT with constraints and intermittent connectivity to the limitless, always connected IP-centric approach of the cloud.

Previously, we outlined different technical approaches dealing with the connection challenges, but could we achieve better than DTLS + CID? The answer may be “yes.” A new security mode has been proposed for the CoAP protocol called Object Security for Constrained RESTful Environments (or “OSCORE”). It is defined in [RFC 8613]. The protocol is based on the CBOR Object Signing and Encryption (COSE) protocol, and is defined in the [RFC 8152].

But before going into further details, it is worth mentioning that this solution is specific to the CoAP protocol, and can’t be used for other protocols. This is not a replacement of DTLS, but a new security mode that can be used on top of DTLS if necessary. Of course, it can also be used without DTLS. That’s why here we’ll demonstrate the usage of OSCORE with UDP only.

We won’t go into the details of the protocol, as these details will be part of another article. But one of the main advantages of the protocol is that it enables partial encryption for messages sent by the Client to the Server. In turn, this reduces the overall overhead of the encryption and helps to save bandwidth.

Let’s now discuss how the protocol can help us to solve the NAT table timeout problem. The OSCORE protocol can be seen as a mixture between “regular” DTLS and the Connection Identifier extensions. Each security session is identified through a Key ID Context. This key ID is used by the Server to associate its internal security context with the Client. This mechanism can be seen as the Connection Identifier extension.

But comparing with DTLS, the OSCORE protocol does not need to perform a handshake similar to the DTLS handshake. A handshake does exist, but it’s much shorter. And if we take into consideration the fact that the messages exchanged between the Client and the Server are partially encrypted, the overhead is much lower than the DTLS handshake.

To take a concrete example and compare it with the UDP + DTLS + Connection Identifier solution, let’s take the same example as before.

To initiate a secure connection using OSCORE, the LwM2M Client needs to send its registration message during the negotiation of the OSCORE context with the LwM2M Server. The LwM2M Server will then reply with both the registration acknowledgement message and the end of the negotiation of the OSCORE context. It helps to remove the usual handshake messages since they are directly part of the first application messages exchanged between the Client and the Server. In addition to this scenario, the negotiation of the OSCORE context can be derived multiple times through the Appendix B.2 of the RFC 8613.

In total, to establish a secure connection:

- - DTLS (with ciphersuite: TLS_PSK_WITH_AES_128_CCM_8) + Connection Identifier (with a length of 9 bytes): 988 bytes
    - DTLS handshake: 746 bytes
    Client Hello (137 bytes)
    Hello Verify Request (60 bytes): optional step
    Client Hello + Cookie (169 bytes): optional step
    Server Hello (124 bytes)
    Server Hello Done (25 bytes)
    Client Key Exchange (41 bytes)
    Change Cipher Spec (14 bytes)
    Encrypted Handshake Message (69 bytes)
    Change Cipher Spec (14 bytes)
    Encrypted Handshake Message (93 bytes)
    - LwM2M registration message: 242 bytes
    Registration (165 bytes)
    Acknowledgement (77 bytes)

- - OSCORE: 359 bytes
    - OSCORE context negotiation with LwM2M registration message: 359 bytes
    Registration (147 bytes)
    Response with Appendix B.2 (28 bytes)
    Registration with Appendix B.2 (157 bytes)
    Acknowledgement (27 bytes)

Then, to maintain the connection (by looking at only the Registration Update message with its acknowledgement):

- - DTLS + Connection Identifier: 130 bytes
    - LwM2M registration update message: 130 bytes
    Registration Update (53 bytes)
    Acknowledgement (77 bytes)
  - OSCORE: 65 bytes
    - LwM2M registration update message: 65 bytes
    Registration Update (48 bytes)
    Acknowledgement (17 bytes)

Conclusion

As highlighted above, the problem of NAT table timeout is not a simple problem to solve. But fortunately, there are good solutions to overcome this problem. It depends on your use case and the constraints that you have. If you can’t use OSCORE security, you may be interested in using DTLS with Connection ID. But if one of the DTLS peers does not support a Connection ID, you would then use the usual mechanisms that are costly in terms of bandwidth.

The best solution would of course be to rely on the most recent solution (OSCORE security), but due to the lack of support for this security mode, this solution has not been widely adopted yet. More and more devices are using DTLS with a Connection ID. Unfortunately, some issues came up during the early days of this solution’s adoption: a few existing implementations were written while the RFC wasn’t finished yet, which made them incompatible with each other. The situation has greatly improved since the final version of the RFC. However, it did cause a lot of confusion, and some implementations even support the old version of the RFC with all its variations.

Today, the most used solution is still the session resumption mechanism of DTLS.

Connection-less Transport Management for Long Run Sessions

Solutions

Raw UDP

UDP + DTLS

UDP + DTLS + Connection Identifier

UDP + OSCORE

Conclusion

Recommended For You

Urban Control embeds IoTerop’s ALASKA as it doubles down on interoperability standards in its smart city solutions

IoTerop incorporates LwM2M 1.2 enhancements into IoT Device Management products

Intelliport uses IoTerop’s IOWA software development kit to offer LwM2M interoperability in its full product portfolio

Newsletter Sign-up