#work_in_progress #blog_post A couple weeks ago a friend of mine working at Grafana showed me a video of a hackathon project. The project was called something along the lines of agentic research or deep researcher, where they implemented research on user actions using an LLM model. While I thought that was very cool™ I could not avoid thinking about the limitations of using an LLM to do analysis on a set of actions due to the fact that the model was incapable of learning userbase-level statistics. That is to say that the model is incapable of knowing what "regular behavior" is because it can only interact with one set of actions at a time without memory of other users. This is likely not a big limitation for a hackathon project. However, developing tools that can capture the complexities of user actions is a critical issue for many applications.This is especially important when identifying behavioral patterns for regression, recommendations or anomaly detection. To address this, the use of deep learning architectures have been used to create high-quality representations for user behavior by leveraging techniques used for language modeling. In this article I want to cover some of these techniques because I think they are cool™ but also because this will be the future for monitoring any task that can be represented as a sequence. In particular I will be focusing on unsupervised representation learning since it's the most applicable without having to label considerable amounts of data. # What is a foundational model? The name foundational model comes from the paper [On the Opportunities and Risks of Foundation Models](https://arxiv.org/abs/2108.07258) from Stanford University. A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks. Some examples are BERT, GPT-2, CLIP or more recently the open-source Chinese models like Qwen. When this paper came out everybody hated it ([proof](https://www.youtube.com/watch?v=tunf2OunOKg&t=715s)), mainly because it was citation farming but also because the term "foundational model" was too broad and lacked specifics. However as time went on, we eventually used the term to refer to any model that can be used as a starting point to fine-tune for a specific application. # What is a representation? A representation (or embedding) in machine learning is a way of encoding data into a numerical format that captures the essential features and relationships within that data. An encoder model transforms raw data (audio, text, images, video) into a list (generally called a vector) of numbers representing the raw data. Think of it as translating information into a mathematical language that algorithms can understand and work with effectively. One of my favorite examples of an application for encoders is facial recognition. Facial recognition works by training a model (don't worry about how) to receive images of faces as inputs and output a list of numbers. During training, the model learns that photos of the same person should get similar numbers, while photos of different people should get very different numbers. This makes it so that the model learns to give similar numbers to photos of the same person. So if you have two different photos of someone, they'll both get very similar lists of numbers because my face has the same basic features in both pictures. Facial recognition works by taking the encoded picture of an identified person and comparing them to the embedding of a picture from an unidentified person. If the lists are similar enough, then we determine that both pictures are from the same person. ![[Pasted image 20250731094124.png]] The key takeaway here is that embeddings are very useful to group and segment unidentified (unlabeled) data. For a more detailed explanation, I recommend the ["Embeddings are underrated"](https://technicalwriting.dev/ml/embeddings/overview.html#underrated) blog post. Now that we know what an encoder is and what foundational models are, we will define a foundational sequence encoder model. # Foundational Sequence Encoder for Behavioral Monitoring™ A foundational sequence encoder model is a neural network pre-trained on large-scale sequential data that learns to create rich representations of sequences, which can then be fine-tuned or adapted for multiple downstream tasks. A sequence can be many things: it could be a time series or set of time series, it could also be a list of actions or even a video. The embeddings could also be at an instance level where you create a representation for your entire sequence, or they could be at timestamp level where you create a representation for every timestamp in the series. ![[encoders.drawio.png]] Timestamp level representations tend to be better for regression and anomaly detection while instance level representations are useful for classification and for clustering. For those more familiar with language models, the model equivalence for instance level encoders are BERT models, where the output for the entire text input is an embedding. For timestamp level encoders, they are akin to spanBERT models where the output is an embedding for every input token. # Encoding time series ## Learning to embed time series Let's start by identifying what type of training we would have to do in order to learn representations for time series. Looking at the [Unsupervised Representation Learning for Time Series: A Review](https://arxiv.org/abs/2312.04142) literature review we can start to see a pattern, most recent papers (up until the publishing date of August 2023) are focusing on contrastive learning as this appeared to be the dominant focus. This is currently the dominating approach for time series representations with [TimeDRL](https://arxiv.org/abs/2312.04142) (a contrastive learning approach) being the state of the art for time series representations. For the purpose of this article I will focus on [TS2vec](https://arxiv.org/abs/2106.10466) and [TimeDRL](https://arxiv.org/abs/2312.04142) as I find them to be very instructive to explain the modern approach for representation learning on time series. (I will also likely write an in-depth article going over some of their behaviors) Before moving on, I would like to point out another cool trend which is the increase in papers treating videos as time series. Very cool! ## Short intro to contrastive learning Contrastive Learning is a Machine Learning paradigm where unlabeled data points are juxtaposed against each other to teach a model which points are similar and which are different. Going back to our facial recognition example, most facial recognition is trained using contrastive learning. These models are generally trained with the optimization objective of creating similar representations when evaluating pictures of the same person (referred to as a positive pair) and having very different representations for pictures of different people (referred to as a negative pair). To improve this learning further there is a focus on creating "hard training batches" by selecting similar looking people or by taking pictures of the same person from different angles or lighting. This proves to considerably improve performance and is the basis for papers like [Contextual Document Embeddings](https://arxiv.org/abs/2410.02525). For more details on how Facial recognition works I recommend the blog: [Triplet Loss: Intro, Implementation, Use Cases](https://www.v7labs.com/blog/triplet-loss). ## TS2vec TS2vec was the state of the art model until march of 2023 when [SimTS](https://arxiv.org/abs/2303.18205) came out; at the time the model represented a considerable breakthrough surpassing all comparable models. The model is primarily timestamp-based, however the authors provide a methodology to derive instance level embeddings through max pooling. The model's remarkable performance can be attributed to 2 main factors: its training and its pair construction. ### Pair construction for time series representation For learning time series representations, we generally create positive and negative pairs to contrast. For negative pair construction, you generally just select 2 different time series. You might also implement some **hard negative mining**, where you select time series that are similar but different in some key way in order to make the model learn to differentiate similar inputs. Getting positive pairs is a little more complicated. Instead of searching for time series that are very similar, most contrastive learning approaches compare a time series to an augmented version of the time series. This approach however can introduce biases into the model. The following are some of the most common pair construction methods and will add context to the augmentations used by TS2vec. #### Subseries sampling Subseries sampling works by selecting a time series and constructing a positive pair from a subsection. In the following picture green is the original time series and yellow is the subseries used as its pair. ![[Pasted image 20250806175421.png]] The bias learned is known as **Subseries consistency** [(Franceschi, Dieuleveut, and Jaggi, 2019)](https://arxiv.org/abs/1901.10738) which encourages the representation of a time series to be closer to its sampled subseries. This however can lead the model to ignore important information from the added context in its representation. ![[Pasted image 20250806214722.png]] For the pair in the example above, the subseries in green lacks the context of the rest of the sequence. In order for the yellow area to be encoded the same as the green one, the model must ignore the surrounding context. #### Adjacent Subseries Sampling Adjacent subseries sampling works by selecting 2 adjacent but disjointed portions from the same time series at 2 different points in time. ![[Pasted image 20250806215853.png]] The bias learned from this is called Temporal consistency [(Tonekaboni, Eytan, and Goldenberg, 2021)](https://arxiv.org/abs/2106.00750) which enforces the local smoothness of representations by choosing adjacent segments as positive samples. The main drawback with this type of sampling is that you might sample 2 subseries with wildly different behaviors. ![[Pasted image 20250806215710.png]] For the example above, it's unreasonable to expect both series to have similar encodings, so training towards this objective can result in unexpressive embeddings. #### Transformation of the original series Another alternative is to create a pair using a transformation of the original series. Below we see the original time series (green) and its moving average (yellow). ![[Pasted image 20250807120106.png]] The idea here is to try to teach the model to create the same representation for the original time series and the transformed time series. The idea is to make the model transformation invariant, meaning that both the original series and the transformed series would have similar embeddings. The main risk with using transformations is that the model might confuse different time series if they share a particular property. This may or may not be a good thing. For example if you want to be able to train an instance representation model to cluster time series with similar seasonal patterns regardless of the magnitude of the time series. To achieve this you could create positive pairs where you compare an original time series with the same time series scaled up or down. While this might work in specific cases, the general mindset when implementing deep learning systems is that you shouldn't make assumptions about the data and instead you should let the model learn from the data. ### TS2vec augmentations In order to avoid the biases mentioned previously, TS2vec utilizes 2 novel augmentations: **Random overlapping cuts** and **Timestamp Masking**. #### Random overlapping cuts Random overlapping cuts works by sampling 2 subseries from a time series, such that both time series intersect. Since TS2vec produces timestamp embeddings, the idea here is that the timestamp embedding for the intersection should be similar despite the 2 subseries having differing contexts. ![[Pasted image 20250808133142.png]] #### Timestamp masking Timestamp masking is applied in conjunction with the random overlapping cuts. Timestamp masking works by randomly setting to zero half of the timestamps after the first layer of the neural net. The idea here is that the embeddings should remain similar even with some information hidden from them. This also makes it so the model can learn to partially reconstruct the removed information. ![[Pasted image 20250808141504.png]] ### Training TS2vec ![[ts2vec_model.png]] First let's examine the general architecture for the model for an univariate time series. As seen above first we create a pair of examples using **Random overlapping cuts**. Notice that after our first layer we apply **timestamp masking.** After we use dilated convolutions to create the timestamp embeddings. From here the paper introduces it's main contribution to training, **Hierarchical Contrasting**. Hirerchical contransting works by calculating multiple losses at diffent levels of resultions. This is done by repeatedly max pulling with a size 2 kernel. ![[max_pool.png]] We repeat the process until we can´t apply it anymore, meaning that if we create an embedding for 32 timestamps of length 100. We will repeat the process 5 time until we have an embedding of shape (1,100). This last one becomes our instance level representation. By calculating loss at different level we intend to achieve 3 things: - **Capture multi-scale patterns**: Different temporal granularities reveal different types of patterns - **Learn universal representations**: The same representation can be used for tasks requiring different semantic levels (timestamp-level for forecasting, instance-level for classification) - **Improve robustness**: Learning at multiple scales makes the model more robust to noise and variations We will now explain how loss is calculated, this step is repeated for every resolution level. #### Temporal Contrastive Loss To learn discriminative representations over time, TS2Vec takes the representations at the same timestamp from two views of the input time series as positives, while those at different timestamps from the same time series as negatives. This encourages the model to learn that the same moment in time should have consistent representations regardless of the augmentation context. ![[Pasted image 20250808133142.png]] Looking back at this example, the region betwen 35 and 50 would be where the positive pairs are found while, the area outside the intersection would contain the negative pairs. #### Instance-wise Contrastive Loss Instance-wise Contrastive works by taking representations at the **same timestamp** from two augmented views of the **same time series** as **positive pairs**. It then uses representations from **other time series** as **negative pairs**. Helps the model learn instance-specific characteristics that differentiate one time series from another. Both of this losses are calculated at every resolution levels. ### How do I know my embeddings are good? / Evaluating TS2vec Traditionally in machine learning, we partition the data into training and testing (maybe also validation). We do this to verify wheather the model generalizing correctly, instead of "Memorasing the answers" in data used for fitting the model. However this type of testing only works in situation where you have labels for your data. So then how can we measure performance?? To evaluate unsupervised representations (in an academic setting), we typically use labeled benchmark datasets. First we train the representation model on train partition of the data. Then we fit a predictive model in the train partition using the representations as inputs to predict the labels. Finally, we evaluate this model on the test split to assess representation quality. To benchmark TS2vec and TimeDRL the authors use forecasting as the task to evaluate timestamp representations and classification for instance representations. In order to evaluate with forecasting you train a model to predict the next timestamp value using your last timestamp embedding as an input. To evaluate instance embeddings with classification you would train a model that takes the instance level representation and predicts a class. Note that anomaly detection is a classification task too. %% Because we only care about representations to the degree that they are usefull for downsteam aplications. One of the way we can measure the performance, is by using a small set of labeled data making predictions using the representations. %% ### Benchmarking Datasets To showcase the performance of TS2vec we will be focusing on the performance from the model across Established benchmark datasets used in the literature. It´s worth nothing that in the original paper the **125 UCR datasets** and **29 UEA datasets**. However since this datasets are not commonly used, we will no be discussing them. #### Forecasting datasets | Datasets | Features | Timesteps | Frequency | | ------------- | -------- | --------- | --------- | | ETTh1 & ETTh2 | 7 | 17,420 | 1 hour | | ETTm1 & ETTm2 | 7 | 69,680 | 5 min | | Exchange | 8 | 7,588 | 1 day | | Weather | 21 | 52,696 | 10 min | The **ETT** **(Electricity Transforming Temperature)** datasets originate from real-world electric power infrastructure monitoring in China. Where electricity transforming temperature and power load data were collected from different provinces over a two-year period using both hourly and 15-minute sampling intervals to capture the operational patterns of electrical systems. There are 4 variants to this datasets, ETTh1 and ETh2 were hourly sampled and ETTm1 and ETm2 The **Exchange** dataset was compiled from international financial markets, tracking the daily fluctuations of foreign currency exchange rates across eight major economies including Australia, Britain, Canada, Switzerland, China, Japan, New Zealand, and Singapore over a 26-year span from 1990 to 2016. The sampling date is 1 day. **Weather** data comes from the National Centers for Environmental Information's comprehensive climatological monitoring network. The network systematically records meteorological conditions across nearly 1,600 locations throughout the United States over a four-year observation period. The dataset includes features such as maximum, minimum, and average temperature, temperature departure from normal, dew point temperature, average station pressure, ceiling, visibility, weather type, wet bulb temperature, relative humidity, degree days (heating and cooling), daily precipitation, average wind speed, fastest wind speed/direction, sky cover, and occurrences of sunshine, snowfall and snow depth #### Classification Datasets | Datasets | Samples | Features | Classes | Length | | --------------- | ------- | -------- | ------- | ------ | | FingerMovements | 416 | 28 | 2 | 50 | | PenDigits | 10,992 | 2 | 10 | 8 | | HAR | 10,299 | 9 | 6 | 128 | | Epilepsy | 11,500 | 1 | 2 | 178 | | WISDM | 4,091 | 3 | 6 | 256 | **FingerMovement**s comes from ergonomic and human-computer interaction research, where a single subject's typing behavior was monitored during three separate six-minute keyboard typing sessions. The feature consist of reading from brain scans while typing. The classification target is whether the user will use the left or right hand for the next keystroke. **PenDigits** was developed through a handwriting recognition experiment where 44 individuals wrote digits 0 through 9 on a digitizing tablet that captured the x and y coordinate. The objective in this dataset is to classify the digit that is being written. **HAR** represents a controlled human activity recognition study where researchers equipped 30 participants with Samsung Galaxy S2 smartphones, to capture accelerometer and gyroscope measurements, while they perform distinct physical activities. The possible classes are walking, jogging, upstairs, downstairs, sitting, standing. The **Epilepsy** dataset emerges from medical research facilities where neurologists recorded single-channel EEG brain activity from 500 patients at 174 Hz frequency for 23.6 seconds each, creating a valuable resource for automated epilepsy detection systems. **WISDM** expands on similar human activity monitoring by collecting sensor data from both smartphones and smartwatches worn by 51 test subjects who each performed 18 different activities for precisely three minutes per activity, creating a comprehensive dataset of human movement patterns. The objective is to classify time series as either having epilepsy or not. ### Results for forecasting #### A (very) brief history of timestamp representations We will be comparing TS2vec to the previous state of the art model. **TCN (2016)** was the firs use of convolutional layer for time series representation, they achieved this with causal convolutions which avoid data leakage for future data points. **Informer (2020)** was the first state of the art model to leverage transformers for time series representations.**TNC (2021)** utilizes temporal neighborhoods (time series with similar static values and some point in time) to create contrastive pairs. **CoST (2022)** main innovation was including a learnable Fourier layer that enabled high quality representation of seasonality. It's worth nothing that this model were only the state of the art for timestamp representations. #### Results The following results were obtained using two linear layers and a sequence length of 24. More details are available in the TimeDRL paper. <div class="ml-model-comparison" data-exclude-models="TimeDRL,SimTS" data-title="TS2Vec Forecasting Performance"></div> Overall we can see that in every benchmark TS2vec is either the best of second best model. This is as CoST outperforms on dataset with heavy seasonality. ### Results for classification #### A (very) brief history of instance representations We will be comparing TS2vec to the previous state of the art models. **SimCLR (2020)** was the first breakthrough in contrastive learning for visual representations, introducing the NT-Xent loss and demonstrating that data augmentations and projection heads were critical for effective contrastive learning. **BYOL (2020)** revolutionized self-supervised learning by eliminating the need for negative pairs entirely, using a teacher-student architecture with exponential moving averages to bootstrap representations. **CCL (2020)** tackled the false negative problem in contrastive learning by incorporating clustering information to filter out semantically similar negatives and supplement positives. **TSTCC (2021)** was the first major adaptation of contrastive learning specifically for time series, introducing temporal and contextual contrasting modules to capture both temporal dependencies and discriminative features. **MHCCL (2023)** advanced time series contrastive learning by using hierarchical clustering to create masked cluster-wise contrastive pairs, addressing the false negative problem at multiple granularity levels. Again It's worth mentioning that these models were primarily focused on instance-level representations. TS2Vec which achieved state-of-the-art performance on both timestamp and instance level representations!!!!!!!!!!!!!!!!!!!!!!!! <div class="ml-model-comparison" data-type="classification" data-exclude-models="TimeDRL" data-title="TS2Vec Classification Performance"></div> As before TS2vec stays near the top on every benchmark. Showcasing it's versatility across domains. ### Impact When it came out TS2vec represented a shift in Paradigm, being the first model too achieve SOTA in both instance and timestamp level representations. After this for a model to be on the bleeding edge it would have to provide high quality representations for both types of representations. It also emphasized that reducing the biases from pair selection improved the model expressivity. This ideas were later expanded on by TimeDRL. ## TimeDRL TimeDRL <div class="ml-model-comparison" data-title="TimeDRL Forecasting Performance"></div> ### Results <div class="ml-model-comparison" data-type="classification" data-title="TimeDRL Classification Performance"></div>