Today I read a paper from 2025 titled "Predicting Short-Term Price Trends of Cryptocurrencies Using Order Book Data," authored by X account @Kev, which everyone can check out. The core finding of the paper: high-frequency data preprocessing is prioritized over model complexity. After proper data cleaning, manually designed features combined with simple models perform comparably to, or even better than, fully automated deep models (neural networks automatically learning features). This finding is a mainstream consensus in traditional finance, but there is not much research on this aspect in the cryptocurrency market. The research data is raw L2 order book data from Bybit's public API dated January 30, 2025. A snapshot is taken every 100ms, with a maximum of 200 levels of buy and sell orders per snapshot. The main experiment used 100,000 entries (about 166 minutes), and the sequential experiment expanded to 1,000,000 entries (about 28 hours). The data is freely available, so the reproducibility of the paper is quite good. The research method divides the data into three groups: unfiltered, SG filtering, and Kalman filtering, and then inputs them into six models to predict the price direction after 100ms / 500ms / 1s under binary classification (up/down) and three-class classification (up/flat/down) labels. In total, there are 3 (data preprocessing) × 6 (6 groups of models) × 2 (binary or three-class prediction results) × 3 (three prediction time windows) = 108 experimental groups. The models are grouped by complexity as follows: - Simple models (logistic regression and XGBoost): manually designed features (such as the difference in buy/sell volume, supply-demand imbalance) are used as model inputs. They are the fastest, and we can understand how the model makes judgments based on features, knowing not only the results but also the reasons behind them. - Hybrid models (CNN+CatBoost and CNN+XGBoost): instead of manually designing features, the neural network learns the features from the data itself, which are then input into decision trees. The advantage is that it may discover feature combinations that humans would not think of, but the downside is that these features are difficult to explain, knowing the results but not fully understanding the reasons. - Deep models (DeepLOB and its simplified version): a fully end-to-end neural network that automatically completes everything from feature extraction (with the difference that this time it can extract sequential information as features) to final judgment, knowing the results but not the reasons. The evaluation metric is prediction accuracy (technically called the F1 score, which measures both "how many times you predicted an increase and it actually increased" and "how many times you captured the increase when it actually happened," ranging from 0 to 1, with higher being better). Training time is also recorded. The training set is 80%, and the test set is 20%, with no cross-validation performed because time series data is not suitable for random shuffling. Core Point 1: Data quality is more important than model selection. Taking the prediction of a three-class 500ms 40-level order book as an example: - With the same XGBoost, the prediction accuracy is 0.45 when using raw data, which rises to 0.54 after SG smoothing, an improvement of about 21%. - When switching to a more complex DeepLOB model, the accuracy on raw data is even lower (0.43). Even with SG smoothing applied (0.52), it still does not outperform XGBoost+SG (0.54). The improvement in data quality far exceeds the improvement in model complexity. Why does SG filtering work so well? Raw order book data is very noisy, with prices and order volumes fluctuating wildly at the millisecond level, which is typically considered "flickering" caused by market makers quickly adjusting quotes. SG filtering uses a small window that slides over the data, fitting a smooth curve within the window at each position and taking the value at the center of the curve as the smoothing result. Unlike simple moving averages, it does not smooth out true trend turning points—because it fits the shape of the data with a curve rather than taking a crude average. A single line of code in scipy can call it, with a window of 21 and a third-order polynomial being the most stable parameters in the paper, which can serve as a starting point for everyone's research. 2. Decision window constrains model complexity. Here, two concepts need to be distinguished: - Training time is the offline model training time (one-time) - Inference time is the time it takes for the model to make a prediction each time new data comes in during live trading. Inference frequency depends on strategy design, and the duration of the decision window determines the upper limit of inference speed, which in turn constrains model complexity. ...