#work_in_progress #blog_post A couple weeks ago a friend of mine working at Grafana showed me a video of a hackathon project. The project was called something along the lines of agentic research or deep researcher, where they implemented research on user actions using an LLM model. While I thought that was very cool™ I could not avoid thinking about the limitations of using an LLM to do analysis on a set of actions due to the fact that the model was incapable of learning userbase-level statistics. That is to say that the model is incapable of knowing what "regular behavior" is because it can only interact with one set of actions at a time without memory of other users. This is likely not a big limitation for a hackathon project. However, developing tools that can capture the complexities of user actions is a critical issue for many applications.This is especially important when identifying behavioral patterns for regression, recommendations or anomaly detection. To address this, the use of deep learning architectures have been used to create high-quality representations for user behavior by leveraging techniques used for language modeling. In this article I want to cover some of these techniques because I think they are cool™ but also because this will be the future for monitoring any task that can be represented as a sequence. In particular I will be focusing on unsupervised representation learning since it's the most applicable without having to label considerable amounts of data. # What is a foundational model? The name foundational model comes from the paper [On the Opportunities and Risks of Foundation Models](https://arxiv.org/abs/2108.07258) from Stanford University. A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks. Some examples are BERT, GPT-2, CLIP or more recently the open-source Chinese models like Qwen. When this paper came out everybody hated it ([proof](https://www.youtube.com/watch?v=tunf2OunOKg&t=715s)), mainly because it was citation farming but also because the term "foundational model" was too broad and lacked specifics. However as time went on, we eventually used the term to refer to any model that can be used as a starting point to fine-tune for a specific application. # What is a representation? A representation (or embedding) in machine learning is a way of encoding data into a numerical format that captures the essential features and relationships within that data. An encoder model transforms raw data (audio, text, images, video) into a list (generally called a vector) of numbers representing the raw data. Think of it as translating information into a mathematical language that algorithms can understand and work with effectively. One of my favorite examples of an application for encoders is facial recognition. Facial recognition works by training a model (don't worry about how) to receive images of faces as inputs and output a list of numbers. During training, the model learns that photos of the same person should get similar numbers, while photos of different people should get very different numbers. This makes it so that the model learns to give similar numbers to photos of the same person. So if you have two different photos of someone, they'll both get very similar lists of numbers because my face has the same basic features in both pictures. Facial recognition works by taking the encoded picture of an identified person and comparing them to the embedding of a picture from an unidentified person. If the lists are similar enough, then we determine that both pictures are from the same person. ![[Pasted image 20250731094124.png]] The key takeaway here is that embeddings are very useful to group and segment unidentified (unlabeled) data. For a more detailed explanation, I recommend the ["Embeddings are underrated"](https://technicalwriting.dev/ml/embeddings/overview.html#underrated) blog post. Now that we know what an encoder is and what foundational models are, we will define a foundational sequence encoder model. # Foundational Sequence Encoder for Behavioral Monitoring™ A foundational sequence encoder model is a neural network pre-trained on large-scale sequential data that learns to create rich representations of sequences, which can then be fine-tuned or adapted for multiple downstream tasks. A sequence can be many things: it could be a time series or set of time series, it could also be a list of actions or even a video. The embeddings could also be at an instance level where you create a representation for your entire sequence, or they could be at timestamp level where you create a representation for every timestamp in the series. ![[encoders.drawio.png]] Timestamp level representations tend to be better for regression and anomaly detection while instance level representations are useful for classification and for clustering. For those more familiar with language models, the model equivalence for instance level encoders are BERT models, where the output for the entire text input is an embedding. For timestamp level encoders, they are akin to spanBERT models where the output is an embedding for every input token. # Encoding time series ## Learning to embed time series Let's start by identifying what type of training we would have to do in order to learn representations for time series. Looking at the [Unsupervised Representation Learning for Time Series: A Review](https://arxiv.org/abs/2312.04142) literature review we can start to see a pattern, most recent papers (up until the publishing date of August 2023) are focusing on contrastive learning as this appeared to be the dominant focus. This is currently the dominating approach for time series representations with [TimeDRL](https://arxiv.org/abs/2312.04142) (a contrastive learning approach) being the state of the art for time series representations. For the purpose of this article I will focus on [TS2vec](https://arxiv.org/abs/2106.10466) and [TimeDRL](https://arxiv.org/abs/2312.04142) as I find them to be very instructive to explain the modern approach for representation learning on time series. (I will also likely write an in-depth article going over some of their behaviors) Before moving on, I would like to point out another cool trend which is the increase in papers treating videos as time series. Very cool! ## Short intro to contrastive learning Contrastive Learning is a Machine Learning paradigm where unlabeled data points are juxtaposed against each other to teach a model which points are similar and which are different. Going back to our facial recognition example, most facial recognition is trained using contrastive learning. These models are generally trained with the optimization objective of creating similar representations when evaluating pictures of the same person (referred to as a positive pair) and having very different representations for pictures of different people (referred to as a negative pair). To improve this learning further there is a focus on creating "hard training batches" by selecting similar looking people or by taking pictures of the same person from different angles or lighting. This proves to considerably improve performance and is the basis for papers like [Contextual Document Embeddings](https://arxiv.org/abs/2410.02525). For more details on how Facial recognition works I recommend the blog: [Triplet Loss: Intro, Implementation, Use Cases](https://www.v7labs.com/blog/triplet-loss). ## TS2vec TS2vec was the state of the art model until march of 2023 when [SimTS](https://arxiv.org/abs/2303.18205) came out; at the time the model represented a considerable breakthrough surpassing all comparable models. The model is primarily timestamp-based, however the authors provide a methodology to derive instance level embeddings through max pooling. The model's remarkable performance can be attributed to 2 main factors: its training and its pair construction. ### Pair construction for time series representation For learning time series representations, we generally create positive and negative pairs to contrast. For negative pair construction, you generally just select 2 different time series. You might also implement some **hard negative mining**, where you select time series that are similar but different in some key way in order to make the model learn to differentiate similar inputs. Getting positive pairs is a little more complicated. Instead of searching for time series that are very similar, most contrastive learning approaches compare a time series to an augmented version of the time series. This approach however can introduce biases into the model. The following are some of the most common pair construction methods and will add context to the augmentations used by TS2vec. #### Subseries sampling Subseries sampling works by selecting a time series and constructing a positive pair from a subsection. In the following picture green is the original time series and yellow is the subseries used as its pair. ![[Pasted image 20250806175421.png]] The bias learned is known as **Subseries consistency** [(Franceschi, Dieuleveut, and Jaggi, 2019)](https://arxiv.org/abs/1901.10738) which encourages the representation of a time series to be closer to its sampled subseries. This however can lead the model to ignore important information from the added context in its representation. ![[Pasted image 20250806214722.png]] For the pair in the example above, the subseries in green lacks the context of the rest of the sequence. In order for the yellow area to be encoded the same as the green one, the model must ignore the surrounding context. #### Adjacent Subseries Sampling Adjacent subseries sampling works by selecting 2 adjacent but disjointed portions from the same time series at 2 different points in time. ![[Pasted image 20250806215853.png]] The bias learned from this is called Temporal consistency [(Tonekaboni, Eytan, and Goldenberg, 2021)](https://arxiv.org/abs/2106.00750) which enforces the local smoothness of representations by choosing adjacent segments as positive samples. The main drawback with this type of sampling is that you might sample 2 subseries with wildly different behaviors. ![[Pasted image 20250806215710.png]] For the example above, it's unreasonable to expect both series to have similar encodings, so training towards this objective can result in unexpressive embeddings. #### Transformation of the original series Another alternative is to create a pair using a transformation of the original series. Below we see the original time series (green) and its moving average (yellow). ![[Pasted image 20250807120106.png]] The idea here is to try to teach the model to create the same representation for the original time series and the transformed time series. The idea is to make the model transformation invariant, meaning that both the original series and the transformed series would have similar embeddings. The main risk with using transformations is that the model might confuse different time series if they share a particular property. This may or may not be a good thing. For example if you want to be able to train an instance representation model to cluster time series with similar seasonal patterns regardless of the magnitude of the time series. To achieve this you could create positive pairs where you compare an original time series with the same time series scaled up or down. While this might work in specific cases, the general mindset when implementing deep learning systems is that you shouldn't make assumptions about the data and instead you should let the model learn from the data. ### TS2vec augmentations In order to avoid the biases mentioned previously, TS2vec utilizes 2 novel augmentations: **Random overlapping cuts** and **Timestamp Masking**. #### Random overlapping cuts Random overlapping cuts works by sampling 2 subseries from a time series, such that both time series intersect. Since TS2vec produces timestamp embeddings, the idea here is that the timestamp embedding for the intersection should be similar despite the 2 subseries having differing contexts. ![[Pasted image 20250808133142.png]] #### Timestamp masking Timestamp masking is applied in conjunction with the random overlapping cuts. Timestamp masking works by randomly setting to zero half of the timestamps after the first layer of the neural net. The idea here is that the embeddings should remain similar even with some information hidden from them. This also makes it so the model can learn to partially reconstruct the removed information. ![[Pasted image 20250808141504.png]] ### Training TS2vec ![[ts2vec_model.png]] First let's examine the general architecture for the model for an univariate time series. As seen above first we create a pair of examples using **Random overlapping cuts**. Notice that after our first layer we apply **timestamp masking.** After we use dilated convolutions to create the timestamp embeddings. From here the paper introduces it's main contribution to training, **Hierarchical Contrasting**. Hirerchical contransting works by calculating multiple losses at diffent levels of resultions. This is done by repeatedly max pulling with a size 2 kernel. ![[max_pool.png]] We repeat the process until we can´t apply it anymore, meaning that if we create an embedding for 32 timestamps of length 100. We will repeat the process 5 time until we have an embedding of shape (1,100). This last one becomes our instance level representation. By calculating loss at different level we intend to achieve 3 things: - **Capture multi-scale patterns**: Different temporal granularities reveal different types of patterns - **Learn universal representations**: The same representation can be used for tasks requiring different semantic levels (timestamp-level for forecasting, instance-level for classification) - **Improve robustness**: Learning at multiple scales makes the model more robust to noise and variations We will now explain how loss is calculated, this step is repeated for every resolution level. #### Temporal Contrastive Loss To learn discriminative representations over time, TS2Vec takes the representations at the same timestamp from two views of the input time series as positives, while those at different timestamps from the same time series as negatives. This encourages the model to learn that the same moment in time should have consistent representations regardless of the augmentation context. ![[Pasted image 20250808133142.png]] Looking back at this example, the region betwen 35 and 50 would be where the positive pairs are found while, the area outside the intersection would contain the negative pairs. #### Instance-wise Contrastive Loss Instance-wise Contrastive works by taking representations at the **same timestamp** from two augmented views of the **same time series** as **positive pairs**. It then uses representations from **other time series** as **negative pairs**. Helps the model learn instance-specific characteristics that differentiate one time series from another. Both of this losses are calculated at every resolution levels. ### How do I know my embeddings are good? / Evaluating TS2vec Traditionally in machine learning, we partition the data into training and testing (maybe also validation). We do this to verify wheather the model generalizing correctly, instead of "Memorasing the answers" in data used for fitting the model. However this type of testing only works in situation where you have labels for your data. So then how can we measure performance?? To evaluate unsupervised representations (in an academic setting), we typically use labeled benchmark datasets. First we train the representation model on train partition of the data. Then we fit a predictive model in the train partition using the representations as inputs to predict the labels. Finally, we evaluate this model on the test split to assess representation quality. To benchmark TS2vec and TimeDRL the authors use forecasting as the task to evaluate timestamp representations and classification for instance representations. In order to evaluate with forecasting you train a model to predict the next timestamp value using your last timestamp embedding as an input. To evaluate instance embeddings with classification you would train a model that takes the instance level representation and predicts a class. Note that anomaly detection is a classification task too. %% Because we only care about representations to the degree that they are usefull for downsteam aplications. One of the way we can measure the performance, is by using a small set of labeled data making predictions using the representations. %% ### Benchmarking Datasets To showcase the performance of TS2vec we will be focusing on the performance from the model across Established benchmark datasets used in the literature. It´s worth nothing that in the original paper the **125 UCR datasets** and **29 UEA datasets**. However since not great to visualize and don't showcase the utility of this time of representations I opted to use the datasets from the TimeDRL paper. #### Forecasting datasets | Datasets | Features | Timesteps | Frequency | | ------------- | -------- | --------- | --------- | | ETTh1 & ETTh2 | 7 | 17,420 | 1 hour | | ETTm1 & ETTm2 | 7 | 69,680 | 5 min | | Exchange | 8 | 7,588 | 1 day | | Weather | 21 | 52,696 | 10 min | The **ETT** **(Electricity Transforming Temperature)** datasets originate from real-world electric power infrastructure monitoring in China. Where electricity transforming temperature and power load data were collected from different provinces over a two-year period using both hourly and 15-minute sampling intervals to capture the operational patterns of electrical systems. There are 4 variants to this datasets, ETTh1 and ETh2 were hourly sampled and ETTm1 and ETm2 The **Exchange** dataset was compiled from international financial markets, tracking the daily fluctuations of foreign currency exchange rates across eight major economies including Australia, Britain, Canada, Switzerland, China, Japan, New Zealand, and Singapore over a 26-year span from 1990 to 2016. The sampling date is 1 day. **Weather** data comes from the National Centers for Environmental Information's comprehensive climatological monitoring network. The network systematically records meteorological conditions across nearly 1,600 locations throughout the United States over a four-year observation period. The dataset includes features such as maximum, minimum, and average temperature, temperature departure from normal, dew point temperature, average station pressure, ceiling, visibility, weather type, wet bulb temperature, relative humidity, degree days (heating and cooling), daily precipitation, average wind speed, fastest wind speed/direction, sky cover, and occurrences of sunshine, snowfall and snow depth #### Classification Datasets | Datasets | Samples | Features | Classes | Length | | --------------- | ------- | -------- | ------- | ------ | | FingerMovements | 416 | 28 | 2 | 50 | | PenDigits | 10,992 | 2 | 10 | 8 | | HAR | 10,299 | 9 | 6 | 128 | | Epilepsy | 11,500 | 1 | 2 | 178 | | WISDM | 4,091 | 3 | 6 | 256 | **FingerMovement**s comes from ergonomic and human-computer interaction research, where a single subject's typing behavior was monitored during three separate six-minute keyboard typing sessions. The feature consist of reading from brain scans while typing. The classification target is whether the user will use the left or right hand for the next keystroke. **PenDigits** was developed through a handwriting recognition experiment where 44 individuals wrote digits 0 through 9 on a digitizing tablet that captured the x and y coordinate. The objective in this dataset is to classify the digit that is being written. **HAR** represents a controlled human activity recognition study where researchers equipped 30 participants with Samsung Galaxy S2 smartphones, to capture accelerometer and gyroscope measurements, while they perform distinct physical activities. The possible classes are walking, jogging, upstairs, downstairs, sitting, standing. The **Epilepsy** dataset emerges from medical research facilities where neurologists recorded single-channel EEG brain activity from 500 patients at 174 Hz frequency for 23.6 seconds each, creating a valuable resource for automated epilepsy detection systems. **WISDM** expands on similar human activity monitoring by collecting sensor data from both smartphones and smartwatches worn by 51 test subjects who each performed 18 different activities for precisely three minutes per activity, creating a comprehensive dataset of human movement patterns. The objective is to classify time series as either having epilepsy or not. ### Results for forecasting #### A (very) brief history of timestamp representations We will be comparing TS2vec to the previous state of the art model. **TCN (2016)** was the firs use of convolutional layer for time series representation, they achieved this with causal convolutions which avoid data leakage for future data points. **Informer (2020)** was the first state of the art model to leverage transformers for time series representations.**TNC (2021)** utilizes temporal neighborhoods (time series with similar static values and some point in time) to create contrastive pairs. **CoST (2022)** main innovation was including a learnable Fourier layer that enabled high quality representation of seasonality. It's worth nothing that this model were only the state of the art for timestamp representations. #### Results The following results were obtained using two linear layers and a sequence length of 24. More details are available in the TimeDRL paper. <div class="ml-model-comparison" data-exclude-models="TimeDRL,SimTS" data-title="TS2Vec Forecasting Performance"></div> Overall we can see that in every benchmark TS2vec is either the best of second best model. This is as CoST outperforms on datasets with heavy seasonality. ### Results for classification #### A (very) brief history of instance representations We will be comparing TS2vec to the previous state of the art models. **SimCLR (2020)** was the first breakthrough in contrastive learning for visual representations, introducing the NT-Xent loss and demonstrating that data augmentations and projection heads were critical for effective contrastive learning. **BYOL (2020)** revolutionized self-supervised learning by eliminating the need for negative pairs entirely, using a teacher-student architecture with exponential moving averages to bootstrap representations. **CCL (2020)** tackled the false negative problem in contrastive learning by incorporating clustering information to filter out semantically similar negatives and supplement positives. **TSTCC (2021)** was the first major adaptation of contrastive learning specifically for time series, introducing temporal and contextual contrasting modules to capture both temporal dependencies and discriminative features. **MHCCL (2023)** advanced time series contrastive learning by using hierarchical clustering to create masked cluster-wise contrastive pairs, addressing the false negative problem at multiple granularity levels. Again It's worth mentioning that these models were primarily focused on instance-level representations. TS2Vec which achieved state-of-the-art performance on both timestamp and instance level representations!!!!!!!!!!!!!!!!!!!!!!!! <div class="ml-model-comparison" data-type="classification" data-exclude-models="TimeDRL" data-title="TS2Vec Classification Performance"></div> As before TS2vec stays near the top on every benchmark. Showcasing it's versatility across domains. ### Impact When it came out TS2vec represented a shift in Paradigm, being the first model too achieve SOTA in both instance and timestamp level representations. After this for a model to be on the bleeding edge it would have to provide high quality representations for both types of representations. It also emphasized that reducing the biases from pair selection improved the model expressivity. This ideas were later expanded on by TimeDRL. ## TimeDRL Time DRL is (arguably) the current state of the art in time series representation (and definetly the best model trained on contrastive learning). Just as TS2vec this model creates embedding for both timestamp and instance level. TimeDRL was the first SOTA model to use the transformer architecture. It also achive this while using no augmentations for it's training!!!!!!!!!! Which is absolutely crazy. ### Architeccture overview Broadly speaking TimeDRL is not fundamentally different is not fundamentaly different from a standard encoder transformer but I want to c ![[Pasted image 20251102210006.png]] ### Patching time series Traditionally transformers are not used as much in time series task as much as in NLP, despite both being sequence modeling tasks. One of the major factors that has historicly hinder the use of tranformers for time series modeling is the **Computational Complexity** of self-attention, as it scales quadraticly with sequence lenght ($O(n²)$). To adress this, languague model use kv-caching to avoid computing the Key and Query vectors. However the same cannot be applied for time series without a considerable drop in performance. To adress the authors utilize time series patching. Time series patching works by grouping adjecent pointstime series toguether on the feature dimension instead of processing every timestamp individually. To do this we reshape a time series of shape $(T, C)$ and converts it to $(Tp, C×P)$ where: - $T =$ original time steps - $C =$ number of channels/features - $T_{p} =$ number of patches (T/P) - $P =$ patch length For example let's say we want to process an univariate time series with 8 steps. Instead of processing all 100 steps we can stack the steps in groups of 2 into the feature dimension resulting in a feature matrix of (4,2) Instead of (8,1). ![[Pasted image 20250919195850.png]] Bellow we show an example for a multivariate time series. ![[Pasted image 20250919200310.png]] Time series patching for transformers was first introduce in the paper ["A Time Series is Worth 64 Words: Long-term Forecasting with Transformers"](https://arxiv.org/abs/2211.14730). The ideas to shorten the sequence length while preserving all of the information. This reduces the computational cost while keeping a lot of the performance. With patching complexity is now $O((n/P)² × d)$ where $n$ is the sequence length and $P$ is the patch size. Additionally, the time series are normalized by subtracting the mean and dividing by the standard deviation across the time dimension. For multivariate time series, each feature is normalized independently using only its own statistics. ### Time series transformers I will not dive into the actual transformer architecture as it is outside the scope of this article, However it is worth mentioning some of the decision taken by the author as they are quite interesting. #### [CLS] token The [CLS] token is a special input token prepended to the beginning of every input sequence. Its primary purpose is to aggregate information from the entire sequence for classification tasks. Google introduced the [CLS] token in the original BERT paper ; it replaces the need for expensive pooling strategies and became the standard way to obtain a sentence embedding from a Transformer encoder. Since then [CLS] token have been adopted in vision transformers and in multimodal models like CLIP. Interestingly as far as I can tell this is the first time this token is used in time series modeling. In this paper the authors append the token is prepended to the patched sequence. The token is initialize randomly and is learned during training. ![[Pasted image 20251101224735.png]] ### Positional encoding Most of the state of the art transformers for languague modeling tend to use either Relative positional encodings or RoPe embeddings. Interestingly the authors opted for learnable positional encodings (like the original BERT paper). While tha authors do not give an explanation for this I belive this is done because, different patches may have different importance for different time series task an so havin the freedom to make it learnable improves the model (that being said this in not expanded on in the ablation studies). ### Drop out One piece of advice that you might see if you are looking into training a transformer for language tasks is that you should set the dropout to 0. It's fairly well documented that it's better to just set dropout to 0 as there is very little overfitting in the current sinlge epoch paradigm, this holds both for generative and encoder only language models ( here is [paper](https://arxiv.org/abs/2505.24788) I found on the topic). However in this paper dropout is a key factor in for training (more on that later) and is added after the self-attention, the feedforward network and on residual connections. ## Training TimeDRL TimeDLR utilizes 2 types of losses to learn represenations. Contrastive loss for instance level represenations and MSE for timestamplevel represenations. ## Contrastive learning for instace level represenations ## Pair construction When talking about TS2vec we discussed an effort to avoid augmentations that lead to baises in representation, because this baises result in less expressive models. So in order to avoid baises created by augmentatios the authors decided to not use any type of augmentation!!!!!!!!! #### Negative pair construction????? NO!!!! To understand why, let's talk about another type of bias that we would like to avoid, **sampling bias**. Sampling bias occurs when randomly selected negative samples are similar to the positive samples, which is a common scenario in the timeseries domain due to the presence of periodic patterns. ![[Pasted image 20250827154521.png]] In the example above we see a 2 different time series that have identical subseries despite being from different time series. To avoid the sampling bias the authors.... just don't sample. TimeDRL uses exclusively positive pairs to train!!!! #### Positive pair contraction In order to create positive pairs the autors opt to run the same input twice and then compare them!!!!! Since when using dropout you are inyecting randomness into the representation, even with the same input we will get similar but different representations. ![[Pasted image 20251102223821.png]] Becaues both representations come from the same time series (With some varitation due to drop out) they both should have very similar representations. To apply a contrastive loss using this augmentation we do the following: First we pass encode the same patched time series twice: $z^1​=f_θ​([CLS]+x_{patched}​),\quad z^2​=f_θ​([CLS]+x_{patched​})$ Rember that since the encoder used dropout both of the representation shigtly different. Then we take the first row representing the output from the [CLS] token. $z^1_{i}​=z^1_{i}​​[0,:] \quad z^2_{i}​​=z_{i}​​[0,:]$ Then we apply two layer MLP with ReLu activations that we will reffer as $c_θ​$. $\hat{z}^1_{i}​=c_θ​({z}^1_{i}),\hat{z}^2_{i}​=c_θ​({z}^2_{i}​) $ To calculate the contrastive loss we utilize cosine similarity. Since we are trying to make both representations similar to one another. To do this we want will calculate the loss from the prediction $\hat{z}_{1i}$ while ignoring the gradients from $\hat{z}_{2i}$ and vice versa. $\begin{aligned} & L_{C 1}=-\cos \left(\hat{z}^1, \operatorname{stopgrad}\left(z^2\right)\right) \\ & L_{C 2}=-\cos \left(\hat{z}^2, \operatorname{stopgrad}\left(z^1\right)\right) \end{aligned}$ We do this because If both sides update toward each other **at the same time**, the easiest way to minimize this loss is to make the representation a invariant, leading to model collapse. By instead only calculating the vector for one side at a time, we are essentially making on of the representation into a fix target. This **breaks the symmetry**, preventing the trivial solution. This is also backed up on the ablation study part of the paper, where using the stop gradient operation leads to a considerable improvement on performance. We then combine both losses to calculate our final contrastive loss $\mathcal{L}_C=\frac{1}{2} \mathcal{L}_{C^1}+\frac{1}{2} \mathcal{L}_{C^2}$ ## Predictive loss for timestamp level representations To train the timestamp level representations the goal is to make each **timestamp-level embedding** capable of reconstructing its own patch of the time series. This is sometimes referred as reconstructive loss. The encoder produces timestamp embeddings $z_t$ that summarize the local temporal structure for each patch. To teach the model to capture useful temporal information, the authors add a **prediction (reconstruction) head** $pθ$​, which tries to recover the original patched input $x_{patched}$​ directly from these timestamp-level embeddings. To do this we take our output from the encoder output for our timestamps called $z_t$ where $T_P$ is length of our patched time series. $z_t=f_\theta\left([C L S]+x_{\text {patched }}\right)\left[1: T_p+1,:\right]$ These embeddings are passed through a **linear prediction head** $p_θ$​ that reconstructs each patch: $ \hat{x}_{\text {patched }}=p_\theta\left(z_t\right) $ To measure how well the model reconstructs the original patches, the authors use the Mean Squared Error (MSE) between the original and predicted patches: $ \mathcal{L}_P=\operatorname{MSE}(x_{\text {patched }},\hat{x}_{\text {patched }}) $ Also, since we are already running the encoder twice we might as well get the loss from both passes. $\mathcal{L}_P=\frac{1}{2} \mathcal{L}_{P 1}+\frac{1}{2} \mathcal{L}_{P 2}$ A good question to ask now is if this make means that we are training and auto encoder and if that means that the timestamp representations carry the issues associated with one like **Overfitting to construction**. In practice this does not seem to be the case. This is like for 2 main reasons, first decoder is only a linear layer, not a powerful mirror of the encoder, so it can’t memorize pixel-level identity. Secondly the encoder must learn meaningful structure to minimize overall. Since MSE still drives local accuracy, the timestamp embeddings could lean toward low-level signal features. However, the joint **contrastive loss** counterbalances this, forcing global consistency. ### Results <div class="ml-model-comparison" data-type="regression" data-title="TimeDRL Classification Performance"></div> <div class="ml-model-comparison" data-type="classification" data-title="TimeDRL Classification Performance"></div> Overall we can see that TimeDLR is better than the other models at regression and remains comparable in the classifications tasks except Finger movements which is kind of very random. One thing worth nothing about TimeDRL is it demands considerably more compute to train. In a comparison provided in the paper we have the following graph for training with the batches of 32 for 10 epochs using a RTX 3070.![[Pasted image 20251105205758.png]] This is not inherently a bad thing in my opinion, it show that there was a lot of room to improve at the task using more compute. I personally believe that this paper perfectly embodies the thesis from [the bitter lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). The main points from the essay are: - **Computation wins over human-encoded knowledge** - Historically, approaches that rely on general learning methods and massive computation have consistently outperformed systems built with hand-crafted features, domain knowledge, or human insights ( labeling fits as human ins IMO). - **Human knowledge doesn't scale** - Domain-specific techniques that embed human understanding may work well initially, but they create complexity that becomes brittle and doesn't improve much with more resources. - **Search and learning are fundamental** - The two most important methods are search (exploring possibilities) and learning (from experience), ideally you would want for both to be scaled up with more computation. TimeDRL performs really well because it first uses unsupervised learning from unsupervised learning to develop strong representations (fitting the first argument). TimeDRL also outperforms or matches model that leverage human undestanding, like CosT os TS2vec and avoids using augmentations to try and avoid inductive biases. It fits the last point because Contrastive learning explores the embedding space by pushing similar samples together and dissimilar ones apart, this is a form of search through representation space. # Comming soon!!!!!!!!!!!! # Frequency-Masked Embedding Inference: A Non-Contrastive Approach for Time Series Representation Learning; A JEPA architecture for time series represenation