Subscribe to RSS

DOI: 10.1055/a-2705-8654
Predicting Steam Processing Degree of Prepared Radix Rehmanniae (Shudihuang) Using Machine Learning
Authors
Funding The work was supported by China State Institute of Pharmaceutical Industry Co., Ltd. through the Independent Project Grant (Grant No. 2022ZX016).

Abstract
The classic way for the prepared Radix Rehmanniae (PRR) preparation, “steaming and drying (SD) for several cycles (generally nine times, SD9),” is the golden standard method from the traditional processing theory. However, the controversy of “optimal SD cycle” still exists, and there has not been an efficient way to identify the processing degree of PRRs. The study aims to determine the optimal processing conditions to make PRR approaching the SD9 quality and establish validated models to identify the processing degree of PRR unknowns. PPR-SD1–9 samples were prepared under 1 to 9 SD cycles. A spectrophotometer and a camera were used for color and gloss assessment. The chemical changes during processing were detected by high-performance liquid chromatography and liquid chromatography coupled with mass spectroscopic (LC-MS) technique. Statistical analysis of the LC-MSE data using principal component analysis showed separation of PRRs with different processing degrees, which led to the use of random forest (RF) for model training. The changes in both appearance and chemicals were obvious before, but slight after 3–5 SD cycles. Two predictors based on the RF classifier were proven to be valid for the identification of steaming time in a wide range (0–78 hours) and with an error rate of 0 in blind verification. Ten commercial samples were then identified as 1–2 SD cycle-processed PRRs. Our results, of which processing 3 to 5 times can get PRR-SD9, are aligned with the past documented knowledge in QianJinYiFang. Our models can be good tools for quality control in PRR manufacture and supervision.
Keywords
prepared Radix Rehmanniae - steaming and drying for 9 cycles - machine learning - random forestIntroduction
As widely used Chinese medicinal herb, the roots of Rehmannia glutinosa (Radix Rehmanniae, RR, “Dihuang”) are commonly used in traditional Chinese medicine (TCM) prescriptions to treat anemia, irregular menstruation, renal failure, and other diseases.[1]
There are three processing forms of RR used as decoction pieces, namely, fresh Radix Rehmanniae, dried Radix Rehmanniae (DRR), and prepared Radix Rehmanniae (PRR, Shudihuang).[2] In ancient China, the PRR preparation method of water steaming was first mentioned in the Synopsis of the Golden Chamber (JinKuiYaoLue, Eastern Han Dynasty, A.D. 25–220) and further detailed in QianJinYiFang (QJYF, Tang Dynasty, AD 682). QJYF recorded that the PRR prepared by the 3 to 5 times steaming process from rice wine (RW) – immersed DRR could be analogous to that prepared by the 9 times steaming process (a more ancient method). Later, in Ming and Qing Dynasty (A.D. 1368–1911), the process of “RW (with or without Fructus Amomi) immersion, together with steaming and drying for nine cycles (SD9),” by which PRR can be obtained successfully by appearance and taste (“black as lacquer, sweet as maltose”), was popularly applied and widely recorded in medicinal works, such as the Compendium of Materia Medica (BenCaoGangMu, A.D. 1552–1578), BenCaoPinHuiJingYao (A.D. 1505), BenCaoBeiYao (A.D. 1694), and others. From ancient times to the present, there is a host of records about processing methods from DRR to RRP, which are slightly.[1] For cost reduction, there has been a trend toward less steaming times and shorter steaming times for PRR manufacturing nowadays. According to the modern processing method described in the Chinese Pharmacopoeia (CP), PRR is prepared by steaming DRR mixed with water (or RW) or stewing DRR mixed with RW, but the number of processing times and steaming time are not defined. The PRR by SD9 is monographed in only a few local specifications, e.g., Henan province specifications for processing of TCM (2022 edition).[3] The specific processing times are still controversial.[4] Level of steamed PRR imparts unique characteristics to the PRRs and influences the quality and final clinical effectiveness of the PRR produced.[1] Thus, it is imperative to determine the optimal steaming times and duration to make the PRR approaching the quality of the traditionally made PRR-SD9.
However, it remains a problem for accurate identification of the steaming degree of PRR for industries and regulatory agencies, adding challenges to the quality assurance of PRR products. The liquid chromatography coupled with mass spectroscopic (LC-MS) technique has been used to generate an enormous amount of data about the RR samples, and the multivariate statistical analysis (MSA) has been successfully applied to classify the samples into DRR and PRR groups and to determine which compounds are correlated with the PRR property.[5] However, the MSA models could not identify the exact processing degree of a new PRR unknown, which needs a more powerful data analysis with a larger number of variables (more complex datasets). Within this context, the machine learning (ML) method is a promising alternative to address this issue.[6] ML techniques have been successfully used in conjunction with LC-MS for TCM quality control (QC)[7] but have only been used once to predict the steaming time (0–15 hours) of PRR[8] with an error rate of 8% (two misidentified samples of the total 24 blind samples). However, the study is missing for intensive steaming degree (more than 15 hours) predictions, where the dataset only includes the visualized oligosaccharides distribution.[8] As the steaming time increases, the change in oligosaccharide profile is initially significant and then later slight.[9] [10] [11] The decrease in relevant features is harmful for a distinction between PRR samples with a deeper steamed degree.
Therefore, the present study aims to (1) determine the optimal processing conditions to make PRR approaching the SD9 quality by an overall assessment in both appearances and chemicals and (2) establish validated models to identify the processing degree of new uninvestigated PRRs by combining metabolomic studies, MSA, and ML algorithms. The best processing degree (or the most feasible endpoint) that matched the quality of traditionally made PRR-SD9 was discussed. We also introduced random forest (RF) methods with LC-MS-based metabolomic datasets that were able to discriminate a broad range of PRR processing degrees. Therefore, this study could guide modern manufacturing processes of PRR preparation and provide a useful tool for assessing the PRR quality.
Material and Methods
Chemicals and Reagents
The LC-MS grade acetonitrile and formic acid were purchased from Macklin (Shanghai, China), and the high-performance liquid chromatography (HPLC)-grade methanol and acetonitrile were acquired from Adamas (Shanghai, China). The additives (phosphoric acid and ammonia) for the mobile phase were obtained from Sinopharm Chemical Reagent (Shanghai, China). The ultra-pure water was in-house prepared by a Milli-Q Integral 5 system (Millipore, Massachusetts, United States). Rehmannioside D (112063–202103, purity 94.2%), catalpol (110808–202313, purity 99.6%), sucrose (111507–202105, purity 99.8%), raffinose (16042, purity 83.6%), stachyose (112031–202203, purity 94.9%), melibiose (13549, purity ≥ 98.0%), manninotriose (16187, purity ≥ 98.0%), fructose (100231–202008, purity 99.9%), mannitol (100533–202207, purity 99.3%), and glucose (110833–202109, purity 99.9%) were purchased from Nature Standard Co., Ltd. (Shanghai, China). The yellow RW (20231028348, 10.5% AbV, 40 mg/mL of glucose) was obtained from Anhui Matouqiang Wine Co., Ltd. (Anhui, China).
Prepared Radix Rehmanniae Samples Preparation and Collection
A total of 80 PRR samples ([Table 1], [Fig. 1]) with different processing excipients and different steaming degrees were prepared from three batches of DRRs, which were collected in Mengzhou county (Henan, China). The preparation procedure was detailed as follows.
Abbreviations: DRR, dried Radix Rehmanniae; PRR, prepared Radix Rehmanniae; RW, yellow rice wine; FA, Fructus Amomi; SD, steaming and drying.


Prepared Radix Rehmanniae (with Rice Wine) by SD for 1–9 Cycles
The DRR samples were cleaned, dried, and divided into four groups by size (#1, ∼13 roots per 100 g; #2, ∼18 roots per 100 g; #3, ∼28 roots per 100 g; #4, ∼40 roots per 100 g) to prepare PRRs under the same processing conditions. The PRR samples (36 batches) with SD for different cycles were obtained as follows: DRR (100 g) was mixed with RW (35 mL), thoroughly moistened for 24 hours, put in the glass dish, water-steamed for 12 hours, and dried at 50°C to 80% of dryness, to obtain the prepared Radix Rehmanniae with SD for 1 cycle (PRR-RW-SD1). Meanwhile, the oily juice was collected in the glass dish. PRR-RW-SD1 was mixed with the juice, thoroughly moistened for 24 hours, water-steamed for 12 hours, and dried at 50°C to 80% of dryness to obtain the PRR-RW-SD2. PRR-RW-SD2 was mixed with the collected juice, thoroughly moistened for 24 hours, water-steamed for 12 hours, and dried at 50°C to 80% of dryness to obtain the PRR-RW-SD3. The PRRs-RW-SD4 to 9 were further prepared in the same way, except that the times to steam PRRs-RW-SD4 to 6 and PRRs-RW-SD7 to 9 were 8 and 6 hours, respectively. After the last steaming, the oily and lustrous roots were sliced and dried at 50°C for 9 hours to obtain the PRR-RW-SD1 to 9 samples, respectively.
Another five batches of PRRs with RW by 5 SD cycles were also repeatedly prepared from two batches of DRR.
Prepared Radix Rehmanniae (without Rice Wine) by SD for 1 to 9 Cycles
All the procedures for PRRs-SD1–9 (without RW) preparations were the same as those for PRRs-RW-SD1 to 9 preparations, except that the RW (35 mL) was replaced with drinking water (35 mL) as a processing excipient. After the last steaming, the oily and lustrous roots were sliced and dried at 50°C for 9 hours to obtain the PRR-SD1–9 samples (36 batches), respectively.
Prepared Radix Rehmanniae with Fructus Amomi by SD
According to the Henan province specifications for processing of TCM, the PRR-Fructus Amomi (FA) was prepared as follows: the DRR samples (100 g) were cleaned, dried at 55°C for 45 hours, mixed with RW (50 mL) and FA (0.9 g), and thoroughly moistened for 24 hours. The moistened roots were put in the glass dish and steamed for 48 hours. The steamed roots were sliced and dried at 50°C for 9 hours to obtain the PRR-FA samples (n = 3).
In addition, 10 commercial samples with unknown steaming degrees purchased from vendors were detailed in [Table 1].
The Determination of the Optimal Steaming Times by Apparent and Chemical Assessment
Color and Gloss Determination
The mean L*ab values (400–700 nm, 10 nm of interval, n = 12) of PRR samples were acquired by an NS800 spectrophotometer (3nh Global, China), which uses a 45°/0° geometrical optical structure complying with CIE No. 15 and GB/T 3978 standards.[12] The DRR sample was used as a reference. The key parameters were set as follows: the light source was D65; the observer's angle was 10 degrees; the color space was CIE LAB and LCh; and the color index was CIE 1976. The reflection ratios (%) at wavelengths of PRRs were obtained from L*ab values by the SQC8 color management control system (3 nh Global, China).
The images of PRR samples were acquired by a Canon EOS M100 camera. The resolution, horizontal resolution, vertical resolution, aperture value, exposure time, ISO speed, and focal length were 6,000 × 4,000, 180 dpi, 180 dpi, f/6.3, 1/40 seconds, ISO-3200, and 45 mm, respectively. An object selection tool (Adobe Photoshop's - Beta) was applied to select the analyzing area in images and view the brightness values[13] of PRR samples. The mean brightness value of all the analyzing areas was calculated for gloss assessment.
Quantification of Iridoids and Mono/Di/Oligosaccharides by High-Performance Liquid Chromatography
The sample pretreatment and HPLC analyses for the measurement of catalpol and rehmannioside D were conducted using the methods of the CP,[2] which was briefly described in the [Supporting Information] (available in online version). Sample solutions preparation, mono/di/oligosaccharides (sucrose, raffinose, stachyose, melibiose, manninotriose, fructose, mannitol, and glucose) quantification, and method validation were also detailed in the [Supporting information] (available in online version).
The Steaming-Induced Hydrolysis of Oligosaccharides
A parallel experiment was performed to understand the mechanism of steaming-induced transformation of saccharides. The chemical changes were detected before (SD0), during (SD1, 2, 3, and 5) and after (SD9) steaming the pure compounds of saccharides for nine cycles. Briefly, the weighted-in quantities for sucrose, raffinose, and stachyose were calculated with reference to the exact quantities of them in the DRR. The solutions of stachyose (n = 3), raffinose (n = 3), and sucrose (n = 3) were generated by dissolving approximately 15.0, 4.0, and 12.5 mg of the pure compounds, respectively, in 50.0 mL of water using glass vessels. These solutions were then steamed with the same process (e.g., time) to PRR preparation (section 2.2) for nine cycles. At intervals of steaming cycles, aliquots of solutions were removed for HPLC analysis, and the same volume of water was added back. All solutions were weighed with the vessel and sampled after cooling to room temperature.
Liquid Chromatography Coupled with Mass Spectroscopy Analysis
The preparation of the sample solution (Step 1 in [Fig. 2]) was detailed in the [Supporting Information] (available in online version). LC-MS analyses for small molecules were performed on a Waters ACQUITY Ultra Performance Liquid Chromatographic (UPLC) system, hyphenated to a Waters Xevo G2-XS-quadrupole time-of-flight (QTOF) MS. The separation was achieved on a waters ACQUITY UPLC HSS T3 column (50 mm × 2.1 mm, 1.8 μm) at 35°C with the mobile phase of water with 0.1% (v/v) formic acid (pharse A) and acetonitrile (pharse B) under the following conditions: 0 to 2 minutes, 1% B; 2 to 4 minutes, 1% to 9% B; 4 to 10 minutes, 9% to 29% B; 10 to 12 minutes, 29 to 48% B; 12 to 27 minutes, 48 to 100% B; 27 to 33 minutes, 100% B; 33 to 33.1 minutes, 100 to 1% B; and 33.1 to 34 minutes 1% B. The flow rate was 0.3 mL/min, and the injection volume was 10 μL. The MS was operated using an electrospray ionization source in negative ion mode. The MS parameters in MSE mode were set as capillary voltage 2.0 kV, source temperature 100°C, desolvation temperature 250°C, cone gas flow 50 L/h, desolvation gas flow 600 L/h, and cone voltage 40 V. All data were collected in the MSE continuum mode and acquired by MassLynx 4.1 software. Mass accuracy of the parent ions and major fragments was limited to within 5 ppm. Leucine enkephalin (1 ng/mL) was used for the lock mass ([M - H]+, m/z 554.2615) at the flow rate of 5 μL/min. The collision energy ranged from 30 to 50 eV for the high-energy function, and the scan time was 0.3 seconds. The mass range was 50 to 1,500 Da.


Depending on the untargeted metabolomics experimental design, a QC sample was prepared by mixing equal volumes (50 μL each) of all samples intended for the metabolomics study. Before initiating the injection sequence, the QC sample was run 10 times to condition the system. Subsequently, a random sequence of study samples was injected, with a QC sample inserted at every 5-sample interval to monitor system stability.
Data Preprocessing and Preparation
Progenesis QI (Waters) was used for LC-MS data preprocessing (Step 2 in [Fig. 2]), including retention time (RT) alignment, peak picking, and normalization. Peak alignment was performed by taking the pooled QCs as the reference. Isotope and adduct deconvolution were applied to reduce the overlap in data features. All data were normalized to the summed total ion intensity per chromatogram, and a table with peaks (each with m/z, RT, and normalized abundance values) was obtained for each experiment. The experiments were performed in two replicates for each of the 85 PRRs.
Then, the resultant data matrices were introduced to EZinfo 2.0 software for Principal Component Analysis (PCA, an unsupervised learning method), to preliminarily assess groupings among the samples according to the steaming intensity level. Next, a Partial Least Squares Discriminant Analysis (PLS-DA) was used to select feature peaks with the Variable Importance in Projection scores greater than 1 (VIP > 1).[14]
Two datasets, the full features (the unique m/z_RT pairs) versus corresponding “normalized abundance” (1, ALL) and the features of VIP > 1 versus corresponding “normalized abundance” (2, VIP), were then fed into the ML models for classification training.
Machine Learning
Classification model development was performed using supervised ML (Step 3 in [Fig. 2]), where the dataset has been explicitly labeled or classified, that is, each data point is known to belong to the category. In supervised learning, the process involves learning from labeled data (training data), and it creates a model that maps inputs (features) to outputs with high accuracy on previously unseen data (blind verification data) during the data validation phase.[15] Four algorithms, including Logistic Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and RF were used, and the results were compared with obtain the model with the highest accuracy. These algorithms were selected considering their respective advantages.
ML algorithms better handle datasets where the sample features exceed the number of samples.[16] PCA, linear discriminant analysis (LDA), and regularization, and SelectKBest were used for dimensionality reduction and feature extraction, respectively, to simplify data and enhance model performance. For example, the SelectKBest class in scikit-learn, categorized as a filter-based feature selection method,[17] implements a two-stage procedure for identifying top-performing features from a dataset: relevance metric computation via statistical hypothesis testing (e.g., mutual information for capturing nonlinear dependencies, chi-square for testing feature-target independence), and subset selection through thresholding on computed scores. For classification problems, the mutual_info_classif variant is preferred, which estimates mutual information ([Eq. 1]) between features X and discrete targets Y.[18] This approach effectively identifies predictive features while maintaining computational efficiency through: empirical probability estimation from sample data, and avoidance of high-dimensional covariance matrix computations required by parametric methods. We also set up a custom feature selector using importance_threshold, where the num_features_to_select was set to 80 to calculate the feature importance and select the top 80 features for model training. All modeling was performed using Python programming language (version 3.12), Scikit-learn ML package (version 1.5.1), and PyCharm IDE (version 2024.1.3).


Except for the data in the blind verification dataset, the remaining data were randomly divided into a training set (80%) and a testing set (20%) for all classifiers. To evaluate the performance of the models for all classifiers, the k-fold cross-validation method was employed, with K set to either 3 or 5. Grid search was applied to explore the optimal parameters for the model. To further evaluate the model's performance, indicators such as average cross-validation score, accuracy, precision, recall, F1 score, and confusion matrix were adopted for a comprehensive assessment.[19] We selected the optimal classification model based on its error rate in predicting the degree of PRR processing when the blind verification dataset was used as input.
Finally, the built models were applied to identify the processing degree of commercial PRR samples.
Results and Discussion
Determination of the Optimal SD Cycles Based on Color and Gloss Analyses
A traditional way to determine the endpoint of the PRR processing procedure was visual assessment by skilled professionals from the “black as lacquer” appearance of PRR.[1] From visual appearance, the color change from DRR to PRR was obvious, but the PRRs from different SD cycles were hardly distinguishable ([Fig. 3A]). A colorimeter and a digital camera were used in this study to provide a more objective description of PRRs. As shown in [Fig. 3B], [C], the reflection ratio at wavelengths of PRR-RW significantly decreased by 50.7 to 56.9% (p = 0.007–0.017, PRR-RW-SD4 vs. PRR-RW-SD5), at the 5th SD cycle, and fluctuated slightly (percent decrease: ranging from −7.4 to 29.1% for PRR-RW-SD6–9), after that; the mean brightness of PRR with RW and PRR without RW increased by 12.4% (p < 0.05, PRR-RW-SD4 vs. PRR-RW-SD5) and 5.4% (p > 0.05, PRR-SD4 vs. PRR-SD5), respectively, at the 5th SD cycle, and fluctuated slightly (percent increase: ranging from −5.2 to 2.9% for PRR-RW-SD6–9; ranging from −3.7 to 1.7% for PRR-SD6–9), after that. In addition, it was found that mixing back with the oily juice was very necessary to enhance the glossiness of PRR. The PRRs prepared without RW showed a higher brightness, compared with the PRRs prepared with RW, at the same SD cycle ([Fig. 3C]). Consequently, the changes in color and brightness were obvious before but slight after the 5th SD cycle.


Determination of the Optimal SD Cycles Based on Sugar and Iridoid Contents
For the analysis of DRR and PRRs, there are two validated HPLC-ELSD methods (detailed in [Supporting Information], [Table S1] and [Fig. S1] (available in online version) for the quantitative determination of fructose, mannitol, glucose, sucrose, melibiose, raffinose, manninotriose, and stachyose, and two HPLC-UV methods recorded in CP for the quantitative determination of catalpol and rehmannioside D.
As shown in [Fig. 3D], sucrose, raffinose, and stachyose drastically decreased by 76.4, 68.7, and 70.2, respectively, after the 1st SD cycle, and further totally converted to glucose and fructose, melibiose and fructose, and manninotriose and fructose, respectively ([Fig. 3E]), until the 2nd or 3rd SD cycle. Glucose, fructose, and manninotriose contents pronouncedly increased by 3.8-, 5.0-, and 9.3-fold, respectively, from DRR to PRR-RW-SD2, and slightly changed during the further stages (from PRR-RW-SD3 to PRR-RW-SD9). As a by-product of the steaming process, melibiose also showed a pronounced increase at the first two SD cycles and a further steady state at the last seven SD cycles. The sugar conversion with breaking of only the fructosidic bond during the steaming process was confirmed by a parallel experiment for pure di/oligosaccharide compounds. The hydrolysis of the galactosidic bond, which was speculated by Zhou et al, was not found in this study.[9] Only the fructose side units were removed from di/oligosaccharides because of the high reactivity of the furanosidic bonds ([Fig. 3E]). The presence of bond opposition or angle strain in furanose 5-membered ring resulted in an easier hydrolysis of the glycosidic bond in furanoside than in pyranoside.[20]
Catalpol decreased quickly and disappeared completely from the 1st to the 2nd SD cycle, whereas rehmannioside D decreased gradually during 9 SD cycles. Catalpol degraded from hydrolysis of the glycosidic bond, ring-opening rearrangement of the hemiacetal moiety, and dehydration of the 6-OH alcohol, subsequently,[21] [22] to form furans and pyrans. Unlike that of catalpol, substitution of the glycosyl group at C-5 of rehmannioside D could inhibit dehydration of the 6-OH alcohol, resulting in the suppression of the degradation rate. Notably, the markers specified in the monograph “PRR” of the CP involve rehmannioside D (specified at ≥0.050%, m/m),[2] which became lower than 0.050% in several PRR-SD5–9 samples (2 PRR samples prepared from 2 batches of RR, data not shown) in this study.
Consequently, the changes in sugar and iridoid contents were obvious before but slight after the 3rd SD cycle. PRRs prepared from 3 to 5 SD cycles would be suggested both to mimic the traditional processing method and to meet the specified criteria.
Determination of the Optimal SD Cycles Based on Untargeted Metabolomic Analyses
Sample solutions contain rich chemical information and can reflect the overall changes in small molecules during the SD processing. The LC-QTOF-MSE data were processed by Progenesis QI software.[23] An unsupervised PCA[24] model obtained from the LC-MS data of all PRR-RW-SDn or all PRR-SDn samples revealed the general structure of the complete dataset, in which the first two PCs cumulatively accounted for 63.1 or 64.8% of the total variation, with PC1 accounting for 39.4% or 48.3% of the variance, discriminating PRRs with different SD cycles ([Fig. 4]). [Fig. 4] revealed two trends of metabolomic profile during the PRR processing both with and without yellow RW. At the first four stages of the SD cycle, small-molecule profiles of PRRs between SD cycles were markedly different. At the last five stages of the SD cycle, multiple replicates of the PRRs-RW-SD5–9 and the PRRs-SD5–9 exhibited similar metabolomic profiles (red square in [Fig. 4]). Consequently, the changes in small-molecule profiles were remarked before but slight after the 4th SD cycle.


Thus, this study demonstrated that PRR by 3 to 5 SD cycles could reach the quality of PRR-SD9 based on the physical and chemical properties.
RR is the typical medicinal herb with the characteristic of “different clinical uses before and after processing.” The previously published data showed a better proliferation effect of polysaccharides from PRR-SD9 (6 hours × 9) than those from PRR-SD1 (12 hours × 1) on rat ovarian granulosa cells.[25] A study also reported that the changes in polysaccharides were obvious before but slight after the 5th SD cycle.[26] Thus, the equivalence of bioactivity between PRR-SD9 and PRRs-SD3–5 could be speculated. However, further research, including a comparative study on pharmacological action (or even clinical efficacy) between PRR-SD9 and PRRs-SD3–5, is needed to confirm the equivalence.
Prediction of Prepared Radix Rehmanniae Processing Degree by Machine Learning
Although known samples (e.g., PRRs- SD1–5) could be classified well from LC-MSE data with PCA, a significant difference could not be observed from intensively steamed PRRs (e.g., PRRs- SD6–9). A more powerful ML model that can distinguish the samples with deep steaming degree and even predict the processing degree of unknown samples is highly needed.
Our model training process strictly followed the established algorithm framework. As a result, we successfully developed two RF models that can accurately predict the PRR processing degree based on input data and can provide a useful tool for the PRR processing optimization and QC.
Data Processing Summary
The LC-MSE data must initially be preprocessed to be able to incorporate them into an ML approach. Two preprocessed by QI[27] datasets were used to create the training, testing, and blind verification sets. The two datasets were 1 (ALL), processed data including all MS peaks (a total of 15,847 peaks) with relative abundance, RT, and the m/z; and 2 (VIP), processed data including MS signals (a total of 2,463 peaks) responsible for feature differentiation (VIP > 1 from PLS-DA analysis)[28] with relative abundance, RT, and the m/z.
Machine Learning Models Selection, Training Optimization, Blind Verification, and Application
First, the three preselected ML algorithms, namely, LR, DT, and SVM, were trained and tested to evaluate the accuracy of prediction using both datasets 1 (ALL) and 2 (VIP) as input. PCA, LDA, and regularization were used for dimensionality reduction of data to avoid overfitting, which is a common problem in ML and deep learning.[19] [29] From [Table 2], the evaluation of various models for the identification of the PRR processing degree demonstrated preliminary performance across key metrics. The ALL-SVM showed an accuracy of 70%, VIP-SVM 70%, ALL-LR 67%, VIP-LR 70%, ALL-DT 79%, VIP-DT 74%, respectively. However, none of these models had a good accuracy for the blind verification set ([Supporting Information], [Fig. S2], available in online version). In the features of PRRs-SD1–5, 83% samples were correctly classified, but in the features of PRRs-SD6,8, and 9, 72% samples were misclassified as PRRs-SD7(8),6(7), and 8, respectively.
Subsequently, the RandomizedSearchCV algorithm optimized the parameters of the DT model, maximizing predictive power. The “criterion” of “entropy” indicated a more effective information gain metric for our dataset. The “max_depth” was set to 5, the “max_features” to 1966, the “min_samples_leaf” 5, and “min_samples_split” 4, respectively. The model, incorporating RandomizedSearchCV and DT, demonstrated better performance with an accuracy of 85% for the training set. Meanwhile, the accuracy percentage for PRR-SD6–9 prediction cannot reach > 90% for the blind verification set. The DT model was suitable for classifying the PRRs with a lighter processing degree, but not suitable for the PRRs with intensive processing degrees.
Given the limitations of DT, we utilized a tree-based RF method, where many DTs are calculated based on the original dataset, and each of them predicts a classification.[30] Indeed, the results of model development revealed the superiority of the RF models in estimating the degree of PRR processing in this study. First, the SelectKBest feature selection was employed in the ALL-RF model, with the mutual information classification specified as the scoring function. Then, the top k = 100 features with the highest scores from the original feature set were selected for model training. In the evaluation results, the model trained with a dataset based on all features (ALL-RF) showed much higher values, with Average cross-validation score, Accuracy, Precision, Recall, and F1 score values of 0.93, 0.96, 0.98, 0.96, and 0.96, respectively ([Fig. 5A]). [Fig. 5B] demonstrates the result of using ALL-RF to classify the training set in the confusion matrix. A total of 93% of the reference samples were classified correctly in groups of PRRs with different processing degrees. Only two PRR-SD8 samples were misclassified in the group of PRR-SD9.


Another RF model trained with VIP > 1 (from PLS-DA) dataset (VIP-RF) was also built, when the top 80 features were selected for model training. The identification of processing degree also achieved impressive results, with an Average cross-validation score of 0.93, an Accuracy of 0.93, a Precision of 0.96, a Recall of 0.93, and an F1-score of 0.93, respectively ([Fig. 5A]). [Fig. 5C] represents the result of using VIP-RF for processing degree identification in a confusion matrix. Also, 93% of the reference samples were classified correctly in groups. Only a PRR-SD6 and a PRR-SD8 sample were misclassified in the group of PRR-SD7.
In the blind verification procedure, a total of 15 PRR samples with different processing degrees (including PRR-SD1–9, PRR-RW-SD1–9, and PRR-FA) were blindly prepared for the ALL-RF and VIP-RF model verification. Both two models achieved 100% accuracy with an error rate of 0, proving more effectiveness and precision than a reported RF model[8] that can only distinguish PRR samples with a light steaming degree (<18 hours) and mis-distinguish two samples in the verification procedure as well. The unique raffinose family oligosaccharides illustrated the features, which were not enough for enabling the discrimination of PRR with a specific steaming degree (especially with a deep degree) from all other PRRs. It was confirmed by the HPLC results of no or slight changes in sugar contents after a deeper processing procedure ([Fig. 3D]), as well as by some other published reports.[9] [10] [11]
Finally, we applied the ALL-RF and VIP-RF models to identify the processing degree of 10 commercial samples obtained from the market. As can be seen in [Fig. 5D], seven batches were identified as 0 to 12 hours steamed samples (equivalent to PRR-SD1), two batches as 12 to 24 hours (equivalent to PRR-SD2), and a batch as 0 to 24 hours (equivalent to PRR-SD1,2). The results reflect the fact[31] that most PRRs in the market are not steamed or processed intensively enough and cannot reach the quality of traditionally made PRR. PRR, as a typical negative example, is usually manufactured by a simplified or nonimplemented processing procedure.[31] In this sense, methodologies that give a more complete image of the features of traditionally made PRRs may play a significant role in QC and standard establishment. During the establishment of standards for TCM decoction pieces, it is essential to study the experience, techniques, and traditions of processing, and then to find the key factors that affect the quality of decoction pieces due to processing.[31] [32] The state-of-the-art approaches, such as LC-MS analysis combined with ML, could be a good tool for bridging the gap between traditions and modernizations of TCM.
Conclusion
Our results demonstrate dynamic changes in color and gloss, sugar and iridoid contents, and metabolomic profile of PRRs throughout the nine processing (steaming and drying) cycles. All these physical and chemical characteristics tend to a steady state after the 3rd to 5th SD cycles, which could be the optimal SD cycles approaching the traditional 9-SD-cycle processing procedure. Notably, our opinion of qualitative equivalence of “3 to 5 times SD cycle”-made PRR with “9 times SD cycle”-made PRR is in good agreement with the ancient record in QJYF (Tang Dynasty, A.D. 682). Understanding these dynamics could lead to improved processing strategies, enhancing both the efficacy and quality of PRR.
Moreover, this study illustrates the potential of LC-MSE data combined with RF algorithms to identify the processing degree of PRR unknowns. Chemical signatures of PRRs with different processing degrees, acquired by LC-MSE analysis, can then be subjected to MSA using predictors based on two RF models (ALL-RF and VIP-RF), to predict the degree of identity of PRR unknowns at an error rate of 0, surpassing the accuracy achieved by previous reported models. Instead of the steaming degree determination based on sensory characteristics (color and flavor) by processing experts, our models can be good tools for QC in PRR manufacture and supervision, for their advantage of high capacity and accuracy for identifying the processing degree of PRR unknowns with an impressively wider range of steaming time (0–78 hours).
Consequently, this work could be an expedition from traditional to controlled process or even perspectives for industrialization.
Supporting Information
This section includes the experiment procedure for quantitative analysis of rehmannioside D and catalpol in DRR or PRR; validation of the HPLC-ELSD method for quantification of 8 sugars in DRR or PRR; and sample preparation for LC-QTOF-MSE analysis.
Method validation for the quantitative determination of fructose, mannitol, glucose, sucrose, melibiose, raffinose, manninotriose, and stachyose ([Table S1], available in online version); chromatographic profiles of the 3 monosaccharides and the 5 di/oligosaccharides ([Fig. S1], available in online version only); and a box plot for the blind verification accuracy obtained by the ML algorithms for identifying the processing levels of PPR ([Fig. S2], available in online version only), were also included.
Conflict of Interest
None declared.
-
References
- 1 Li M, Jiang H, Hao Y. et al. A systematic review on botany, processing, application, phytochemistry and pharmacological action of Radix Rehmnniae. J Ethnopharmacol 2022; 285: 114820
- 2 National Pharmacopoeia Commission. Pharmacopoeia of the People's Republic of China. Beijing: China Medical Science Press; 2020: 129-130
- 3 Henan Provincial Drug Administration. Henan province specifications for processing of TCM (2022 Edition). Zhengzhou: Henan Science and Technology Press; 2022: 138-139
- 4 Xie Y, Zhong LY, Wang Z. et al. Historical evolution and modern research progress of Rehmanniae Radix . Zhongguo Shiyan Fangjixue Zazhi 2022; 24: 273-282
- 5 Li SL, Song JZ, Qiao CF. et al. A novel strategy to rapidly explore potential chemical markers for the discrimination between raw and processed Radix Rehmanniae by UHPLC-TOFMS with multivariate statistical analysis. J Pharm Biomed Anal 2010; 51 (04) 812-823
- 6 Boccard J, Kalousis A, Hilario M. et al. Standard machine learning algorithms applied to UPLC-TOF/MS metabolic fingerprinting for the discovery of wound biomarkers in Arabidopsis thaliana . Chemom Intell Lab Syst 2010; 104: 20-27
- 7 Li Y, Fan J, Cheng X. et al. New Revolution for quality control of TCM in Industry 4.0: focus on artificial intelligence and bioinformatics. TrAC Trends Anal Chem 2024; 181: 118023
- 8 Li H, Zhang S, Zhao Y, He J, Chen X. Identification of raffinose family oligosaccharides in processed Rehmannia glutinosa Libosch using matrix-assisted laser desorption/ionization mass spectrometry image combined with machine learning. Rapid Commun Mass Spectrom 2023; 37 (22) e9635
- 9 Zhou L, Xu JD, Zhou SS. et al. Integrating targeted glycomics and untargeted metabolomics to investigate the processing chemistry of herbal medicines, a case study on Rehmanniae Radix. J Chromatogr A 2016; 1472: 74-87
- 10 Zhou L. Holistic evaluation on quality and efficacy of Rehmanniae Radix Praeparata in “nine cycles of steaming and drying” processing [in Chinese]. [Master's thesis]. Nanjing: Nanjing University of Chinese Medicine; 2017
- 11 Li Y. Processing technology and mechanism of “repeated steaming and air-exposing” of Rehmanniae Radix Praeparata [in Chinese]. [Master's thesis]. Jinan: Shandong University; 2023
- 12 Su C, Chen D. A chromometer that can inspect and measure aperture sizes. CN Patent 222069915U. November, 2024
- 13 Li C, Guo Y, Dong S, Hu Y, Zhang F. Dynamic range adjustment method of the aerospace camera based on histogram distribution. Spacecr Recover Remote Sens 2017; 38: 36-43
- 14 Gao Q, Jiang H, Tang F. et al. Evaluation of the bitter components of bamboo shoots using a metabolomics approach. Food Funct 2019; 10 (01) 90-98
- 15 Morales EF, Escalante HJ. Chapter 6 - A brief introduction to supervised, unsupervised, and reinforcement learning. In: Torres-García AA, Reyes-García CA, Villaseñor-Pineda L, Mendoza-Montoya O. eds. Biosignal Processing and Classification Using Computational Learning and Intelligence. Academic Press; 2022: 111-129
- 16 Dalal N, Sáiz MJ, Caporale AG, Baldini F, Babayan SA, Adamo P. Fishy forensics: FT-NIR and machine learning based authentication of Mediterranean anchovies (Engraulis encrasicolus). J Food Compos Anal 2024; 136: 106847
- 17 Saeed MH, Hama JI. Cardiac disease prediction using AI algorithms with SelectKBest. Med Biol Eng Comput 2023; 61 (12) 3397-3408
- 18 Brownlee J. Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery; 2020: 158-163
- 19 Sha Y, Jiang M, Luo G. et al. HerbMet: enhancing metabolomics data analysis for accurate identification of Chinese herbal medicines using deep learning. Phytochem Anal 2025; 36 (01) 261-272
- 20 Shafizadeh F. Acidic hydrolysis of glycosidic bonds. Tappi 1963; 46: 381-383
- 21 Xue S, Fu Y, Sun X, Chen S. Changes in the chemical components of processed rehmanniae radix distillate during different steaming times. Evid Based Complement Alternat Med 2022; 2022: 3382333
- 22 Yang J, Zhang L, Zhang M. et al. Exploration of the dynamic variations of the characteristic constituents and the degradation products of catalpol during the process of Radix Rehmanniae . Molecules 2024; 29 (03) 705
- 23 Liao J, Zhang Y, Zhang W. et al. Different software processing affects the peak picking and metabolic pathway recognition of metabolomics data. J Chromatogr A 2023; 1687: 463700
- 24 Ringnér M. What is principal component analysis?. Nat Biotechnol 2008; 26 (03) 303-304
- 25 Lin H, Gui SH, Yu BB, Que XH, Zhu JQ. Analysis of polysaccharide monosaccharides of Radix Rehmanniae by different processing processes and their effects on ovarian granulosa cells. Zhongchengyao 2019; 12: 2958-2963
- 26 Jia H, Zhang WF, Lei JW, Li YY, Yang CJ, Fan KF. UV combined with MIR spectroscopy to discuss the dynamic changes of sugar during the processing of Rehmannia glutinosa . Lishizhen Med Materia Medica Res 2023; 2023: 96-99
- 27 Wang XC, Ma XL, Liu JN. et al. A comparison of feature extraction capabilities of advanced UHPLC-HRMS data analysis tools in plant metabolomics. Anal Chim Acta 2023; 1254: 341127
- 28 Tamrakar S, Huerta B, Chung-Davidson YW, Li W. Plasma metabolomic profiles reveal sex- and maturation-dependent metabolic strategies in sea lamprey (Petromyzon marinus). Metabolomics 2022; 18 (11) 90
- 29 Ponce de Leon-Sanchez ER, Dominguez-Ramirez OA, Herrera-Navarro AM, Rodriguez-Resendiz J, Paredes-Orta C, Mendiola-Santibañez JD. A deep learning approach for predicting multiple sclerosis. Micromachines (Basel) 2023; 14 (04) 749
- 30 Benes E, Bajusz D, Gere A, Fodor M, Rácz A. Comprehensive chemometric classification of snack products based on their near infrared spectra. Lebensm Wiss Technol 2020; 133: 110130
- 31 Wang Q, Zhao YX, Gu J. et al. Establishment of traditional Chinese medicine standards reflecting the quality characteristics of Chinese herbal pieces based on processing. Chung Kuo Yao Hsueh Tsa Chih 2025; 40: 114-120
- 32 Xue R, Zhang Q, Mei X. et al. Research on quality marker based on the processing from Aconiti lateralis radix praeparata to Heishunpian. Phytochem Anal 2024; 35 (06) 1443-1456
Address for correspondence
Publication History
Received: 17 April 2025
Accepted: 19 September 2025
Article published online:
24 October 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Li M, Jiang H, Hao Y. et al. A systematic review on botany, processing, application, phytochemistry and pharmacological action of Radix Rehmnniae. J Ethnopharmacol 2022; 285: 114820
- 2 National Pharmacopoeia Commission. Pharmacopoeia of the People's Republic of China. Beijing: China Medical Science Press; 2020: 129-130
- 3 Henan Provincial Drug Administration. Henan province specifications for processing of TCM (2022 Edition). Zhengzhou: Henan Science and Technology Press; 2022: 138-139
- 4 Xie Y, Zhong LY, Wang Z. et al. Historical evolution and modern research progress of Rehmanniae Radix . Zhongguo Shiyan Fangjixue Zazhi 2022; 24: 273-282
- 5 Li SL, Song JZ, Qiao CF. et al. A novel strategy to rapidly explore potential chemical markers for the discrimination between raw and processed Radix Rehmanniae by UHPLC-TOFMS with multivariate statistical analysis. J Pharm Biomed Anal 2010; 51 (04) 812-823
- 6 Boccard J, Kalousis A, Hilario M. et al. Standard machine learning algorithms applied to UPLC-TOF/MS metabolic fingerprinting for the discovery of wound biomarkers in Arabidopsis thaliana . Chemom Intell Lab Syst 2010; 104: 20-27
- 7 Li Y, Fan J, Cheng X. et al. New Revolution for quality control of TCM in Industry 4.0: focus on artificial intelligence and bioinformatics. TrAC Trends Anal Chem 2024; 181: 118023
- 8 Li H, Zhang S, Zhao Y, He J, Chen X. Identification of raffinose family oligosaccharides in processed Rehmannia glutinosa Libosch using matrix-assisted laser desorption/ionization mass spectrometry image combined with machine learning. Rapid Commun Mass Spectrom 2023; 37 (22) e9635
- 9 Zhou L, Xu JD, Zhou SS. et al. Integrating targeted glycomics and untargeted metabolomics to investigate the processing chemistry of herbal medicines, a case study on Rehmanniae Radix. J Chromatogr A 2016; 1472: 74-87
- 10 Zhou L. Holistic evaluation on quality and efficacy of Rehmanniae Radix Praeparata in “nine cycles of steaming and drying” processing [in Chinese]. [Master's thesis]. Nanjing: Nanjing University of Chinese Medicine; 2017
- 11 Li Y. Processing technology and mechanism of “repeated steaming and air-exposing” of Rehmanniae Radix Praeparata [in Chinese]. [Master's thesis]. Jinan: Shandong University; 2023
- 12 Su C, Chen D. A chromometer that can inspect and measure aperture sizes. CN Patent 222069915U. November, 2024
- 13 Li C, Guo Y, Dong S, Hu Y, Zhang F. Dynamic range adjustment method of the aerospace camera based on histogram distribution. Spacecr Recover Remote Sens 2017; 38: 36-43
- 14 Gao Q, Jiang H, Tang F. et al. Evaluation of the bitter components of bamboo shoots using a metabolomics approach. Food Funct 2019; 10 (01) 90-98
- 15 Morales EF, Escalante HJ. Chapter 6 - A brief introduction to supervised, unsupervised, and reinforcement learning. In: Torres-García AA, Reyes-García CA, Villaseñor-Pineda L, Mendoza-Montoya O. eds. Biosignal Processing and Classification Using Computational Learning and Intelligence. Academic Press; 2022: 111-129
- 16 Dalal N, Sáiz MJ, Caporale AG, Baldini F, Babayan SA, Adamo P. Fishy forensics: FT-NIR and machine learning based authentication of Mediterranean anchovies (Engraulis encrasicolus). J Food Compos Anal 2024; 136: 106847
- 17 Saeed MH, Hama JI. Cardiac disease prediction using AI algorithms with SelectKBest. Med Biol Eng Comput 2023; 61 (12) 3397-3408
- 18 Brownlee J. Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery; 2020: 158-163
- 19 Sha Y, Jiang M, Luo G. et al. HerbMet: enhancing metabolomics data analysis for accurate identification of Chinese herbal medicines using deep learning. Phytochem Anal 2025; 36 (01) 261-272
- 20 Shafizadeh F. Acidic hydrolysis of glycosidic bonds. Tappi 1963; 46: 381-383
- 21 Xue S, Fu Y, Sun X, Chen S. Changes in the chemical components of processed rehmanniae radix distillate during different steaming times. Evid Based Complement Alternat Med 2022; 2022: 3382333
- 22 Yang J, Zhang L, Zhang M. et al. Exploration of the dynamic variations of the characteristic constituents and the degradation products of catalpol during the process of Radix Rehmanniae . Molecules 2024; 29 (03) 705
- 23 Liao J, Zhang Y, Zhang W. et al. Different software processing affects the peak picking and metabolic pathway recognition of metabolomics data. J Chromatogr A 2023; 1687: 463700
- 24 Ringnér M. What is principal component analysis?. Nat Biotechnol 2008; 26 (03) 303-304
- 25 Lin H, Gui SH, Yu BB, Que XH, Zhu JQ. Analysis of polysaccharide monosaccharides of Radix Rehmanniae by different processing processes and their effects on ovarian granulosa cells. Zhongchengyao 2019; 12: 2958-2963
- 26 Jia H, Zhang WF, Lei JW, Li YY, Yang CJ, Fan KF. UV combined with MIR spectroscopy to discuss the dynamic changes of sugar during the processing of Rehmannia glutinosa . Lishizhen Med Materia Medica Res 2023; 2023: 96-99
- 27 Wang XC, Ma XL, Liu JN. et al. A comparison of feature extraction capabilities of advanced UHPLC-HRMS data analysis tools in plant metabolomics. Anal Chim Acta 2023; 1254: 341127
- 28 Tamrakar S, Huerta B, Chung-Davidson YW, Li W. Plasma metabolomic profiles reveal sex- and maturation-dependent metabolic strategies in sea lamprey (Petromyzon marinus). Metabolomics 2022; 18 (11) 90
- 29 Ponce de Leon-Sanchez ER, Dominguez-Ramirez OA, Herrera-Navarro AM, Rodriguez-Resendiz J, Paredes-Orta C, Mendiola-Santibañez JD. A deep learning approach for predicting multiple sclerosis. Micromachines (Basel) 2023; 14 (04) 749
- 30 Benes E, Bajusz D, Gere A, Fodor M, Rácz A. Comprehensive chemometric classification of snack products based on their near infrared spectra. Lebensm Wiss Technol 2020; 133: 110130
- 31 Wang Q, Zhao YX, Gu J. et al. Establishment of traditional Chinese medicine standards reflecting the quality characteristics of Chinese herbal pieces based on processing. Chung Kuo Yao Hsueh Tsa Chih 2025; 40: 114-120
- 32 Xue R, Zhang Q, Mei X. et al. Research on quality marker based on the processing from Aconiti lateralis radix praeparata to Heishunpian. Phytochem Anal 2024; 35 (06) 1443-1456












