Assignment 1: Self/Weakly-Supervised Learning on Tabular Data

[Advanced Machine Learning Course]


Research Background:

Out-of-Distribution (OOD) Generalization for Tabular Data

Tabular data are widely used in real-world applications such as financial risk control, medical diagnosis, industrial inspection, and online advertising. They represent one of the most fundamental data formats in industry. However, unlike image or text tasks, tabular data typically exhibit high feature heterogeneity, a mixture of categorical and continuous variables, a lack of explicit structural relationships among features, limited data scale, and strong scenario-dependent distributions. These characteristics pose significant challenges to the generalization ability of deep learning models.

In traditional supervised learning settings, models are optimized under the assumption that the training and test data are independent and identically distributed (i.i.d.). In real-world scenarios, however, this assumption is often violated. For example:

These phenomena are collectively referred to as the Out-of-Distribution (OOD) Generalization problem. Under OOD settings, a model may perform well on the training distribution, but its performance can deteriorate significantly when the test distribution changes, sometimes leading to catastrophic failure. This issue is particularly critical in high-risk domains such as finance and healthcare.

In recent years, substantial progress has been made in OOD learning within computer vision, including domain generalization, test-time adaptation, and causal representation learning. However, many of these approaches rely on structural inductive biases inherent in image data, such as spatial invariance and convolutional architectures. For tabular data, due to the lack of explicit structural information and the semantic independence and non-exchangeability of feature dimensions, existing OOD techniques cannot be directly transferred.

Moreover, tabular data face several unique challenges in OOD scenarios:

  1. Non-shareable feature semantics: Features across datasets often differ significantly in meaning, making it difficult to build unified pre-trained models;
  2. Complex distribution shifts: Covariate shift, label shift, and concept drift may occur simultaneously;
  3. Small-data settings: Large-scale pretraining is often infeasible to mitigate distribution shifts;
  4. Hidden causal relationships: Models based purely on statistical correlations tend to fail under distribution changes.

Therefore, developing robust OOD generalization methods for tabular data has become an important research direction in machine learning. Although existing studies have achieved certain progress, effectively modeling distribution shifts, constructing stable invariant features, and achieving reliable generalization without target-domain labels remain open and challenging problems.

Requirement:

Project Requirements

This project aims to design a self-supervised or weakly-supervised tabular machine learning method based on the TableShift benchmark, in order to improve model generalization under Out-of-Distribution (OOD) settings. The focus of this project is not merely to compare existing approaches, but to propose and implement a novel self-supervised or weakly-supervised mechanism (e.g., Test-Time Adaptation, TTA) that mitigates performance degradation caused by distribution shifts, without relying on target-domain labels.

The core research objectives include:


I. Dataset Requirements

This project will be conducted on the TableShift benchmark. TableShift is a standardized benchmark designed for evaluating OOD generalization in tabular data. It includes multiple real-world datasets and provides clearly defined In-Distribution (ID) and Out-of-Distribution (OOD) splits.

All experiments must strictly follow the official train/validation/test splits. Under the standard OOD setting:

The required datasets are:

For details, refer to the official TableShift GitHub: https://github.com/mlfoundations/tableshift

For convenience, we provide the data download link: https://box.nju.edu.cn/d/958c50ca9223485eadac/ password: will be provided via QQ group.


II. Method Design Requirements

The proposed method must be built around self-supervised or weakly-supervised mechanisms. Students may choose one of the following three categories or combine multiple directions. All methods must clearly specify the data accessible during the training stage.

Category 1: Domain Generalization (DG) Methods

Core idea: Learn invariant representations or robust objectives during training so that the model generalizes to unseen OOD domains.

Data constraints:

Example directions:

Self-supervised extensions may be incorporated during training, such as feature reconstruction, mask prediction, or consistency constraints.

Category 2: Test-Time Adaptation (TTA) Methods

Core idea: Adapt the model at test time using unlabeled target-domain data to mitigate distribution shifts.

Data constraints:

You must clearly specify:

Example directions:

Special attention should be given to the non-exchangeability of tabular features when designing TTA strategies.

Category 3: Other Self-Supervised / Weakly-Supervised Methods

Alternative frameworks beyond DG and TTA may also be proposed, such as:

(1) Model selection or ensemble methods

(2) Self-supervised pretraining + downstream fine-tuning

(3) Causal or structural modeling methods

All data usage constraints must be explicitly clarified.

Innovation Requirement


III. Experimental Report

(1) Method Design

(2) Experimental Setup

(3) Experimental Results

Report OOD performance of your method and at least three baselines:

Additionally report:


IV. Experimental Protocol


V. Report Format

The report must follow the official template of NeurIPS2025.

Template download link: https://media.neurips.cc/Conferences/NeurIPS2025/Styles.zip

The report should follow standard academic paper structure, including abstract, introduction, related work, methodology, experiments, and conclusion.

You may find some useful references from the following papers:

How to submit?

You need to submit the report in PDF format and your code(including a readme file) within a zip file. The zip file should be named as your 'Student ID + name'. We collect your report via nju box. The url is: https://box.nju.edu.cn/u/d/f62af0d399cd4077b1e4/ password: will be provided via QQ group.

The evaluation of your report.

Your languege: concise, precise, and logical.

Your organization: clearly and properly seperated sections and paragraphs

Your format: carefully deal with the citation and overall consistency

Insights: clear, principled, and technically grounded.

About the DEADLINE and Score.

The original deadline was 2026.05.08 23:59:59. Updated at 2026.04.24: the deadline has been EXTENDED to 2026.05.15 23:59:59 (final).

Additionally Issues

If you have any questions, please contact us via email: yuky@lamda.nju.edu.cn.