Just a second...

Introducing topics and data

Diffusion™ stores and distributes data through topics.

Topics

At the heart of the Diffusion model lies the concept of a topic. This page covers the various aspects of topic management that make Diffusion unique, including persistent subscriptions using topic selectors; topic query capabilities; automatic topic removal; how Diffusion achieves high network efficiency using delta streaming, conflation, and compression; and how to secure topic data.

In Diffusion, data is stored and distributed through topics. Each topic has a topic type and a current data value which is maintained in memory on the server. A topic's type determines the data values that can be stored and published through the topic.

Granted sufficient security privileges, a client session can subscribe to a topic to receive notifications when the topic value changes and can also update the value. When a topic is updated, all its subscribers are notified of the new value. Diffusion takes care of efficiently broadcasting value changes, even if there are many thousands of subscribers.

Topics are identified by topic paths. A topic path is a string of parts separated by the / character, for example, weather/capitals/athens. Together, the set of topic paths forms the topic tree.

The topic tree allows topics to be addressed in groups using special expressions called topic selectors. For example, the topic selector ?weather/capitals/ can be used to subscribe to all topics below the topic path weather/capitals. See the full syntax of topic selector expressions.

Topics are lightweight and cheap to create and destroy. There are commercial Diffusion applications that use millions of topics hosted in a single server and create tens of thousands of topics when a new tranche of data items becomes available. The low cost per topic allows for topic trees with a fine-grained mapping to logical data models, with each topic representing a discrete data item that can be updated independently.

Topic types, values, and updates

There are nine topic types that can be grouped into four categories: primitive; composite; multi-valued; and reference.

The four primitive topic types — string, int64, double, and binary — are used for topics with simple, atomic values. String topics store text, int64 and double topics store numbers, and binary topics can store arbitrary data such as a PNG image.

There are two composite topic types: JSON and recordV2. A JSON topic has a JSON value, a format that is familiar to developers and easy for JavaScript clients to process. RecordV2 topics store an array of fields, constrained by an optional schema. RecordV2 topics exist as an upgrade path for applications that were previously using the removed record topic type – new applications should use JSON topics.

Many applications can get by using only the primitive and JSON topic types. Multi-valued and reference topic types are more specialized and less commonly used.

There is a single multi-valued topic type. The time series topic stores a history of events. Events are created by a special type of update. Each event has a timestamp, records the security principal that created it, and has a value. The values for a time series topic are all of the same type, which can be string, int64, double, binary, JSON, or recordV2 – that is, the same data types used for primitive and composite topics. Ranges of events can be queried by data range or event offset.

There are two reference topic types: slave and routing. These are quite different from other topic types. Rather than storing a data value, they re-present the values of other source topics at their topic path. A source topic can be any primitive, composite, or multi-valued topic. A slave topic has a fixed source topic. A routing topic calls out to an application-provided routing handler to determine the source topic for each subscribing session.

When a topic is created, it has no value. A client session can update the topic by providing a value. Primitive and composite topics store the latest received value. Time series topics store a configurable history of values. Each reference topic re-presents the value or values of its source topic.

Sometimes there is no need to store the current value of a topic. Perhaps the value has a limited lifetime and is only of transient worth. A topic can be configured not to retain its last value to reduce the server memory footprint. However, this disables the delta streaming optimization (see below), so is not often done.

For a given topic, the order of value updates is preserved from the source session to the subscriber sessions. If a session updates a string topic with the value A1 followed by A2, the server will notify subscribing sessions of the updates in that order. No guarantees are made about the order of updates across topics. For example, if a session updates topic A with the values A1 and A2, and topic B with the values B1 and B2 , one subscriber might receive A1, B1 , B2 , A2, and another might receive B1 , B2 , A1, A2.

Adding and removing topics

Sessions with appropriate security privileges can add and remove topics. The topic path and topic specification are supplied when adding a topic. The topic specification consists of the topic type and a set of topic properties that allow the behavior of the topic to be adapted to application needs. Some topic properties are specific to the topic type. For example, the TIME_SERIES_EVENT_VALUE_TYPE property configures the data type of the values for a time series topic, and the SCHEMA property configures an optional schema for recordV2 topics.

All topic types support the optional REMOVAL property, which configures an automatic removal policy. Each policy provides a set of conditions under which the server will remove a topic. You can configure a topic to be removed at a future time, if it has stopped receiving updates, if it has no subscribers, or when the server has no client sessions matching specific criteria. The criteria are expressed in terms of session property values.

Subscribing to topics

The server maintains a real-time data model, presented through topics. Each client selectively subscribes to a subset of the data model, according to the needs of the application and data security restrictions applied at the server. Topics provide a fine-grained mapping of the logical data model, so in a typical application each client has a unique partial view of the data model. The client library retains the values of each subscribed topic. The server sends updates to keep the client's view synchronized.

Client sessions subscribe to topic paths using topic selectors. The server persists the set of topic selectors for each session, and dynamically joins selectors against its topics to resolve subscriptions. The dynamic join between topic selectors and topics is unique to Diffusion and is a powerful way to link client applications with a changing data model. The set of topic selectors defines the view of the data model the client is interested in. The server keeps each session up-to-date with the available data that matches the provided topic selectors. Let's look at how this works.

When a session subscribes with a topic selector, the server will resolve subscriptions to all topics with paths matching the topic selector. The session will be notified of the resolved subscriptions and the current value of each topic that has one. The subscription notification includes the topic specification. The server will further notify the session of topic value changes as they occur. A session can subscribe to a topic path for which there is no topic. If a topic is created for the path at a later time, the server will resolve the subscription and notify the session. For example, if a topic weather/capitals/paris is added, subscriptions will be resolved for all sessions that have previously subscribed using the topic selector ?weather/capitals/ . The server will notify the subscribing sessions of the new subscriptions.

If the topic is removed, any resolved subscriptions will be removed, and the previously subscribed sessions will be notified of the unsubscription.

A session can unsubscribe from paths using a topic selector. Subscriptions will be removed for any topics matching the selector to which the session was previously subscribed, and the server will notify the session of the unsubscription. Like subscribe requests, unsubscribe requests are persisted by the server. The session's selector set is the accumulation of the subscribe and unsubscribe events, in the order received. For example, if a session subscribes to ?weather/capitals/ and unsubscribes from >weather/capitals/athens, the selector set will match all topics below weather/capitals except for weather/capitals/athens. On the other hand, if a session first unsubscribes from >weather/capitals/athens and then subscribes to ?weather/capitals/, the subscription will mask the more specific, earlier unsubscription and the selector set will match all topics below weather/capitals.

The dynamic joins extend to slave and routing reference topics. Subscriptions to reference topics are only resolved if the referenced source topic also exists. Subscriptions to reference topics will be removed if either the reference topic or the source topic is removed.

Fetching topic data

A session can fetch the topic specifications and current values of a set of topics. This is a one-off operation that captures a snapshot of the data – the session is not notified of later value updates – but is useful for applications needing to present a static view of the available data.

The set of topics to fetch is specified with a topic selector and can be further constrained to allow the topic tree to be explored page-by-page.

How Diffusion makes efficient use of the network

Many aspects of Diffusion across different architectural layers combine to allow very efficient delivery of application data over the network. The performance translates directly into tangible financial savings for Diffusion users and their customers – more application data can be streamed using less network bandwidth. In addition, applications can provide richer and more data-intensive views.

Diffusion uses a proprietary binary network protocol, designed with close attention to minimizing transport framing costs. For each session, the server balances the batching of updates into network operations against their timely delivery.

The fine-grained mapping of topics to the logical data model allows an application client to select only the data items that it needs. The server maintains the topic selectors for each session, so can immediately subscribe them to new data items without additional interactions. In contrast, publish-and-subscribe messaging systems often require applications to publish the availability of a new data item on one channel, and for interested clients to respond to this event by individual subscribing, which is expensive to process and introduces unnecessary delays.

Through the subscription-based approach, each client session is synchronized with the topics it is subscribed to. Consequently, the server only needs to inform each client of a topic's path and specification when the subscription is resolved. Even better, it allows changes to a topic's value to be sent as an optimal delta stream.

A delta stream encodes a change to the value by sending only the differences between the old value and the new value. Updates to values frequently only affect part of the value. Consider a JSON value – typically the structure of the value including object keys, white space, and delimiters is unchanged between successive updates. Delta streaming is performed automatically and is transparent to the application. The server calculates the differences between the previous value and the new value and sends this to the client. The client applies the differences to its copy of the previous value to calculate the new value. Delta streams are also used when a client session uses an update stream to send a sequence of updates to a topic. Again, Diffusion automatically and transparently calculates and sends differences between successive values. The synchronized, stateful communication used by Diffusion is much more network efficient than the stateless communication used by messaging-based or polling-based systems.

Topic value updates sent from the server to sessions are compressed and decompressed by the clients. The server compresses each update once and re-uses the result for all of the subscribers. Compression is complementary to delta streaming and provides additional efficiency benefits.

Diffusion's conflation feature improves the efficiency, reliability, and timeliness of topic updates sent to slow or temporarily disconnected sessions. The server has a queue of updates for each session. Updates can back-up on a queue if the session is temporarily disconnected, there is a network bottleneck, or the client application is performing slowly. Conflation addresses the backlog by selectively removing out-of-date topic updates. This reduces server memory footprint and the amount of network data required to bring a session back up to date. A conflation policy can be tuned for each topic using the CONFLATION topic property.

Controlling access to topic data

Using Diffusion's role-based authorization system, individual sessions can be granted or denied the rights to add and remove a topic, to subscribe using a topic selector, to view a topic value, or to update a topic value.

Each session has a set of roles obtained through the authentication process or set by control sessions. Each role grants a session various security permissions. Access to topics is controlled via the topic permissions MODIFY_TOPIC, READ_TOPIC, SELECT_TOPIC, and MODIFY_TOPIC. Time series topics can be further controlled by the topic permissions QUERY_OBSOLETE_TIME_SERIES_EVENTS, EDIT_TIME_SERIES_EVENTS, and EDIT_OWN_TIME_SERIES_EVENTS, which grant sessions additional control over the history of time series events.

Topic permissions are assigned to roles for a particular branch of the topic tree. An assignment applies to all topics with paths belonging to the branch unless overridden by a more specific assignment.

The MODIFY_TOPIC permission is required to add or remove a topic. The UPDATE_TOPIC permission is required to update a topic value.

The READ_TOPIC permission is required to subscribe to or fetch a topic. If a session does not have READ_TOPIC permission for a topic, the topic will be excluded from the results of subscription or fetch operations for the session. READ_TOPIC permissions are one factor the server's dynamic join of topic selectors to available topics. If a session's roles change – for example, perhaps a control session applies the *change roles* operation to the session – the server will reevaluate its topic selectors. The session will be subscribed to matching topics for which it now has permission and unsubscribed from the topics for which it no longer has permission.

The SELECT_TOPIC permission is required to use a topic selector, so controls the parts of the topic tree from which a session can subscribe or fetch. Given the READ_TOPIC permission controls access to topic paths, why is this useful? The answer is that some applications delegate subscription to a control session. A session that has READ_TOPIC permission but not SELECT_TOPIC permission for a particular topic path cannot subscribe directly to topics belonging to the path. However, the session can be independently subscribed by a control session that has the MODIFY_SESSION global permission in addition to the appropriate SELECT_TOPIC permission.

Sometimes a topic is used to publish information to a single user, for a user to broadcast information, or to share data between a user's multiple sessions. In these cases, it can be unwieldy to set up lots of specialized topic permissions for the different security principals representing the users. An alternative is to create the topic as owned by a particular principal, using the OWNER topic property. A topic with the OWNER property grants full acccess to sessions authenticated with the named principal. Other sessions continue to be constrained by the configured topic permissions.

Premium features: persistence, replication, and fan-out

Three topic-related features are included in the separately licensed Scale and Availability pack.

Topic persistence logs a server's topic data to disk. Topic persistence allows a server to be stopped and restarted without needing to start a separate client to re-create topics and their values. It can provide faster time-to-recovery and is very useful during development when servers are frequently restarted or test data needs to be shared between developers and environments.

Topic replication mirrors the topic tree across a cluster of peer servers. This improves system availability – the topic data can survive the loss of individual servers – and provides a consistent view of the data to each client session regardless of the server that hosts the session.

Fan-out is designed for replication of topic data between different geographies. Fan-out links can be configured to mirror selected parts of the topic tree from a primary server or cluster of primary servers to one or more secondary servers. The secondary servers present a read-only view of the topic data; updates can only be made through the primary server. Some Diffusion systems use fan-out within a data center, to separate a primary data tier of servers from a secondary tier of servers that host customer sessions. This design allows the secondary tier to be scaled independently to support millions of sessions.