Introduction
For the drug development, farnesyl pyrophosphate synthase (FPPS) enzyme has been established
as a molecular target [1 ]. FPPS (EC 2.5.1.10), a member of E-family of the prenyltransferases, is an important
enzyme in the mevalonate pathway, the elite method of isoprenoid synthesis in animals,
concerned with cholesterol biosynthesis and post-translational modification of signaling
proteins [2 ]
[3 ]
[4 ]. The human farnesyl pyrophosphate synthase (hFPPS) catalyzes the biosynthesis of
the C-15 isoprenoid farnesyl pyrophosphate (FPP) from dimethylallyl pyrophosphate
(DMAPP) through geranyl pyrophosphate (GPP) via one by one abridgment of two isopentenyl
pyrophosphate (IPP) units [5 ]. FPP and GGPP play an imperative function in a surfeit of cellular biological functions
such as over expression of FPPS in fibroblasts also increases farnesylation of Ras
signaling protein and activates the extracellular Ras/ERK signaling cascade [6 ]. Bisphosphonates, the stable analogues of inorganic pyrophosphate, are a class of
drugs that have been used to treat several bone disease such as osteoporosis, post
menopausal osteoporosis in elderly women, osteitis, Paget’s disease of bone, multiple
myeloma, osteogenesis imperfecta and similar diseases [7 ]. Literature survey reveals that N-containing bisphosphonates (NBps) target FPPS
enzyme of the mevalonate pathway [8 ]. Many NBps drugs like alendronate, risedronate and pamidronate have been revealed
to inhibit FPPS enzyme [9 ]
[10 ]
[11 ].
Recently, J. Park et al. [12 ] discovered the biological importance of the allosteric pocket, near the IPP substrate
binding site of the hFPPS active site, and exhibited that FPP product can bind to
this allosteric pocket. This binding sealed the enzyme in an inactive conformation
and therefore, suppling a feedback mechanism for regulating the intra-cellular levels
of isoprenoid biosynthesis in vivo. To understand this mechanism, the thienopyrimidine-based
monophosphonate (ThP-MP) or N-containing monophosphonates (NMps) have been designed
and reported by J. Park et al. [1 ]
[3 ] to get the essential minimal pharmacophore that is required for allosteric inhibition
of hFPPS.
Therefore, many scientists are engaged to find out the liaison between the structure
and the inhibition activities of phosphonate inhibitors [13 ]
[14 ]
[15 ]
[16 ]. The most successful method for constructing this liaison is Quantitative Structure
Activity Relationship (QSAR) method, which is used to predict the drug activities
and information for designing new potential drugs [17 ]
[18 ]
[19 ]
[20 ]
[21 ]. In the present era of drug design, the significance of quantitative structure activity
relationship method is well recognized in view of the fact that QSAR know how to build
the early forecast of activity-related characteristics of drug molecules and can eradicate
the molecules with obnoxious properties [22 ]
[23 ]. Literature survey shows that 3D-QSAR studies have been reported between NBPs and
FPPS [24 ]
[25 ]
[26 ]
[27 ]
[28 ]
[29 ]
[30 ]
[31 ]. However, the majority of QSAR study was performed in non-mammalian species although
Fernández et al. [32 ] and Liu et al. [33 ] reported the 3D-QSAR of N-BPs against hFPPS. It should be more reliable to take
into account thienopyrimidine-based monophosphonate (ThP-MP) and N-BPs to study the
QSAR of phosphonate.
Recent published papers reveal that the simplified molecular input line entry system
(SMILES) is a substitute to classical QSAR methods and it can be used for the prediction
of molecular structures with appropriate end point or activity [34 ]
[35 ]
[36 ]
[37 ]
[38 ]
[39 ]
[40 ]
[41 ]
[42 ]
[43 ]
[44 ]
[45 ]
[46 ]
[47 ]
[48 ]
[49 ]. In all the QSAR models, depending on Monte Carlo optimization method, the pertinent
activity is treated as random event [50 ]
[51 ]
[52 ]
[53 ]. In the light of these facts and in continuation of our work [48 ]
[54 ]
[55 ]
[56 ]
[57 ]
[58 ]
[59 ]
[60 ]
[61 ]
[62 ]
[63 ] on biological important heterocyclic compounds, we herein report Monte Carlo method
based QSAR studies of 73 compounds i. e. thienopyrimidine-based monophosphonate (ThP-MP)
and N-containing bisphosphonates N-BPs active against hFPPS using SMILES and graph
optimal descriptors.
Method
The data set
A series of 73 phosphonate (NBPs and NMPs) derivatives with activity as hfpps inhibitors
was retrieved from literature [3 ]
[32 ]
[33 ]. The structures of NBPs and NMPs were sketched using Marvin sketch [Marvin Sketch
6.0.4 Chem Axon Ltd. http://www.chemaxon.com ] and converted into SMILES by OpenBabel [64 ]. The SMILES notations of NBPs and ThP-MP molecules are presented in Table 1S (supplementary information). The QSAR models were built up for six random splits
(20-30% of compounds were used in calibration set). All six splits were designed according
to the following principles: i) the range of the endpoint (pIC50 ) is evenly distributed for each sub-set ii) the total level of identity between all
the splits is not more than 40% which shows that splits are different (Table 2S , supplementary information).
The identity percentage of the six splits has been confirmed by the reported method
[65 ]. Four sets namely training, invisible-training, calibration, and validation sets
were made from all the six splits. Each set has its sole responsibility. The roll
of the different sets for developing a QSAR model are: (a) The training set (Train)
is designer of the model; (ii) invisible-training (invTrain) set is surveyor of the
model, this set sense and prevent the process of overtraining; (iii) the calibration
set (Calib) is a specialist and have the authority to declare that the model is ready;
(iv) the validation set (Vali) is the reviewer of actual predictive potential of the
model.
Index of ideality of correlation (IIC) used to build up predictive model
The balance of correlations is a technique described in the literature [66 ]
[67 ]
[68 ]
[69 ]. The crux of this technique is building up of a model via the Monte Carlo optimization
of the following target function (TF):
The Rtraining and Rinvisible−training are correlation coefficients between observed and calculated values of an endpoint
for the training and invisible training sets, respectively. The Const is an empirical
constant which is usually fixed [69 ].
In the present manuscript, modified target function (TFm ) for the balance of correlation has been used:
The index of ideality of correlation (IIC) can be an alternative of traditional correlation
coefficient [66 ]
[67 ]. Literature survey shows that a QSPR/QSAR model without IIC for the prediction of
endpoint has some possible defects of error [68 ]
[69 ].
The index of ideality of correlation (IIC) is calculated with the following formula:
The r
calibration is the correlation coefficient value between experimental and calculated values of
an endpoint for the calibration set.
MAE is mean absolute error which can be determined with the following equation:
The quality of prediction (Δ
k ) for one substance from a set can be estimated as the following:
The optimal descriptor
Descriptors procured from either SMILES or molecular graph can be applied to symbolize
the molecular structure. Literature survey reveals that “hybrid” demonstration of
the molecular structure, i. e., by SMILES along with molecular graph, can grant a
better model demonstrated by higher statistical quality than the model which is predicted
by only SMILES or molecular graph[48 ],[70 ]. The hybrid optimal descriptor DCW, adopted for generating QSAR models for the pIC50,
was determined as per the following equation:
Where T is threshold and Nepoch is number of epochs used in monte carlo method optimization. Threshold helps in exclusion
of use of rare molecular features.
SMILES based descriptors were calculated using following equation:
Where, Sk , SSk and SSSk represents local smile attributes while BOND, NOSP, HALO, HARD and PAIR represent
global SMILES attributes and exhibit presence of double, triple or stereochemical
bond, presence of nitrogen, oxygen, sulphur and phosphorus, presence of halogen as
well as pairing between these attributes[71 ]. Optimal descriptors based on graph were calculated using following equation:
Where, EC0k , EC1k , EC2k are Morgan’s extended connectivity indices for kth vertex respectively. VS2k is the sum of vertex degrees which take place at topological distance 2 relative
to kth vertex. NNC is the nearest neighbor code. The C3, C4, C5 and C6 are descriptors for
presence of three-member (C3), four-member (C4), five-member (C5) and six-member rings
(C6) respectively. In this study, hydrogen filled graphs (HFG) have been used.
CORALSEA software was used to prepare the QSAR models. The best model was scrutinized
using the procedure given in literature70 . The most prognostic combination of threshold (T) and number of epochs (Nepoch ) for the six splits was executed from values 1-10 for T and 1-50 for Nepoch . The model was built using balance of correlation technique with index of ideality
of correlation (IIC) of Monte Carlo method. The weight of dr was 0.1. Start step of
the optimization was 0. 5*CW(SA). Precision of the optimization was 0.1*CW(SA) and
weight of IIC was 0.1. Here, CW(SA) is weight of structural attribute (SA) at the
start.
Possessing numerical information on above CW, DCW (T, Nepoch ) can be computed for molecules of training and test set. These data can be used for
calculation of pIC50 according to following Eq. (9):
Validation of QSAR model
The most important objective of any QSAR modeling is to establish a sturdy model competent
to predict the idiosyncrasy of new molecules in an objective, reliable and precise
manner [49 ]
[72 ]. Three methods are cited in the literature for evaluation of sturdiness and reliability
of developed model. These are: (a) internal validation or cross-validation using the
training set compounds (b) external validation using the test set compounds (c) data
randomization or Y-scrambling.
Leave-one-out (LOO) cross validation technique was used to develop models as an internal
validation. Cross-validated Q2 interpret the predictive ability of the model [73 ]
[74 ]. Higher the value of Q2 means better model prediction. The cross-validated Q2 is stated as:
Where, Yobs is observed property of the training set compounds, Ypred is LOO-predicted property of the training set compounds and is mean observed property
of the training set compounds. The predictive ability of model is considered as acceptable
when Q2 is greater than 0.5.
Similar methodology can be applied for external validation. The predictive ability
of a model is determined by calculating Q2
ext which is defined as:
Where, Yobs(test) is the observed property of the test set compounds, Ypred(test) is the predicted property of the test set compounds and Y̅train is mean observed property of the training set compounds. The value Q2
ext for an acceptable model should be greater than 0.5 test [75 ]
.
Y-randomization test was performed to inspect the robustness of the model. A parameter
C R2
p penalizes the model R2 for a small difference between squared mean correlation coefficient (R2 r) of randomized models and squared correlation coefficient (R2 ) of the non-randomized model [75 ]. The parameter C R2
p is defined as:
For an acceptable QSAR model, the value of C R2
p should be greater than 0.5.
Compliance with Applicability domain (AD) and OECD principles
Applicability domain (AD) is another important aspect of a built QSAR model. It gives
the information related to the biological, structure and physiochemical properties
to make the prophecy for new compounds. AD is especially significant because it is
applied for the evaluation of the authenticity of the developed QSAR model. Compounds
from the training set can be used to interpret the AD of the developed QSPR model.
The predictions of a QSAR/QSPR model are more authentic when predicted molecules are
within applicability. For defining AD, procedure reported in literature was applied
[65 ].
To assist the reflection of a QSAR model for regulatory purposes [76 ], it should satisfy the five principal as: 1) a defined endpoint: hFPPS inhibitory activity as definite endpoint, 2) an unambiguous algorithm: Monte Carlo
method as unambiguous algorithm, 3) a defined domain of applicability: percentage
of molecular features with defined role as domain of applicability, 4) appropriate
measures of goodness-of–fit, robustness and predictivity: High values of R2 , Q2 ; C R2
p , R2
m (av) and ∆R2
m metrics, as suitable measures of goodness-of-fit, robustness and predictivity 5)
a mechanistic interpretation, if possible: List of molecular features responsible
for increase and decrease of activity are applied for mechanistic interpretation.
Result and Discussion
Six random splits were generated from the data set presented in Table 1S (supplementary information). All sets were carefully prepared so the ranges are approximately
equivalent for each sub-set. The hybrid optimal descriptor DCW was adopted for generating
QSAR models. The outcomes of the applied methodology for defining AD reveal that maximum
numbers of molecules are within the defined AD. So, all the studied compounds had
typical behavior and all were incorporated in developing QSAR models.
The best QSAR models for six different splits with other statistical parameters of
all six equations are given in [Table 1 ].
Table 1 Statistical parameters of all six QSAR Models
Split
SET
n
R2
CCC
IIC
Q2
s
MAE
F
R2 -Q2
Equation for different Models
Split 1
Training
21
0.9019
0.9484
0.8634
0.8848
0.303
0.238
175
0.0171
pIC
50
=3.6970585 (±0.0365511)+0.0163073 (±0.0002255) * DCW(3,27)
InvTrain
16
0.9053
0.9002
0.2412
0.8835
0.451
0.339
134
0.0218
Calib
19
0.8667
0.9020
0.9309
0.8254
0.398
0.318
110
0.0413
Validation
17
0.7466
NC
0.8479
0.6801
0.486
0.398
44
0.0665
Split 2
Training
20
0.9722
0.9859
0.6574
0.9662
0.173
0.126
628
0.006
pIC
50
=-0.1563000 (±0.0563218)+0.0282081 (±0.0002523) * DCW(2,48)
InvTrain
20
0.9722
0.9805
0.9581
0.9676
0.189
0.143
629
0.0046
Calib
17
0.7386
0.8546
0.8404
0.6860
0.498
0.389
42
0.0526
Validation
16
0.6587
NC
0.6432
0.5872
0.544
0.433
27
0.0715
Split 3
Training
20
0.9827
0.9913
0.8112
0.9787
0.134
0.096
1025
0.004
pIC
50
=-1.5927099 (±0.0653207)+0.0314089 (±0.0002405) * DCW(1,37)
InvTrain
21
0.9828
0.9872
0.5259
0.9792
0.137
0.109
1083
0.0036
Calib
20
0.7924
0.8797
0.8902
0.7531
0.458
0.366
69
0.0393
Validation
12
0.8513
NC
0.6793
0.7840
0.344
0.243
57
0.0673
Split 4
Training
20
0.7380
0.8492
0.7028
0.6888
0.400
0.307
51
0.0492
pIC
50
=-1.3745288 (±0.2456229)+0.0363736 (±0.0010521) * DCW(7,13)
InvTrain
20
0.7361
0.8433
0.5890
0.6926
0.575
0.469
50
0.0435
Calib
15
0.9304
0.9538
0.9614
0.9061
0.267
0.209
174
0.0243
Validation
18
0.7602
NC
0.6703
0.6864
0.486
0.409
51
0.0738
Split 5
Training
19
0.9483
0.9735
0.8766
0.9365
0.195
0.124
312
0.0118
pIC
50
=0.4692126 (±0.0724623)+0.0249464 (±0.0003331) * DCW(2,17)
InvTrain
18
0.9467
0.9341
0.3001
0.9335
0.342
0.279
284
0.0132
Calib
19
0.6500
0.7905
0.7903
0.5775
0.530
0.431
32
0.0725
Validation
17
0.8111
NC
0.6967
0.7591
0.493
0.356
64
0.052
Split 6
Training
17
0.9103
0.9530
0.6677
0.8843
0.224
0.166
152
0.026
pIC
50
=1.8911561 (±0.1196966)+0.0174180 (±0.0003871) * DCW(1,15)
InvTrain
20
0.9070
0.8486
0.2262
0.8871
0.517
0.438
176
0.0199
Calib
17
0.8749
0.9289
0.9348
0.8376
0.428
0.322
105
0.0373
Validation
18
0.7417
NC
0.4430
0.6865
0.450
0.380
46
0.0552
Where, n is number of cases, R2 is the squared correlation coefficient, CCC is concordance correlation coefficient,
IIC is Index of ideality of correlation, Q2 is the leave-one-out (LOO) cross-validation coefficient, s is standard error of estimation,
MAE is mean average error, F is the Fischer ratio and NC means not calculated.
The statistical parameters given in [Table 1 ] clearly indicate that the values of statistical criteria of all six equations are
significant for each subset i. e. training, invisible- training, calibration and validation
sets. The statistical trait of calibration set (R2 =0.9304 and Q2 =0.9061) for the equation of split 4 was distinguished best, therefore the QSAR model
expressed by this was judged to be the preeminent model. The good fitting ability
and good internal predictive power of the described model has been implied by the
R2 value (0.7380) for the training sets and Q2 value of calibration set. The value of R2 is 0.7602 for validation sets which shows the excellent external predictive ability
of the developed QSAR model. The authenticity and robustness of the QSAR models are
justified by less difference between R2 and Q2 values.
Y-randomization process was applied to check the chance correlation, in which Y values
were scrambled in 1000 trials in ten separate runs (Table 3S , supplementary information). The resulting value of C R2
p was found more than 0.5 which authenticate that the developed models are free from
chance correlation [75 ].
The built QSAR model was checked for the external predictive power by applying the
benchmarks proposed by Golbraikh and Tropsha [74 ], Roy and Roy [77 ] and Ojha et al. [78 ]. The values of different proposed benchmarks are displayed in Table 4S (supplementary information) and it can be seen that these values are within specified
ranges. Therefore, it can be said that these QSAR equations have good external predictive
ability. Calculated activity and applicability domain of all compounds along with
different sets are exhibited in supplementary Table 5S (supplementary information). The procedure described in literature was used for the
calculation of applicability domain and it is based on split defect [66 ].
The plots between observed versus calculated activity and residuals versus observed
activity are shown in [Fig. 1 ].
Fig. 1 The plots between observed pIC50 versus calculated pIC50 and residuals versus observed
pIC50.
Mechanistic interpretation of developed QSAR models
The correlation weights (CW) of structural features (SAk) give the information about
the structural attributes which are responsible for the increase and decrease of the
endpoint. If CW(SAk) is positive in all three probes then this feature will enhance
the value of endpoint and if CW(SAk) is negative in all three probes then it will
reduce the value of endpoint. Based on these considerations, some graph based structural
attributes with stable positive values (pIC50 enhancer) for best QSAR model 3 are
EC0-O…1…, EC0-P…4…, EC1-O…4…, NNC-O…101., EC1-C…5…, NNC-P…413, EC2-C…10.., VS2-C…6…
etc. Similarly, some graph based structural attributes with stable negative values
(pIC50 decreasing) for this model are EC0-C…2…, EC1-C…4…, EC0-C…3…, EC0-N…2…, EC1-C…6…,
NNC-C…220., VS2-C…9… etc.
In the same manner, some pIC50 enhancer SMILES based attributes are (…O…(… (branched
oxygen with branching), ++++N---B2==(nitrogen with double bond), ++++N---O===(nitrogen
with oxygen), ++++N---P===(nitrogen with phosphorous), ++++O---B2==(oxygen with double
bond), ++++O---P===(oxygen with phosphorous), ++++P---B2==( phosphorous with double
bond),=…(…….(double bond with branching),=………..(double bond), C………..(methyl group),
O…(…(…(oxygen with double branching), O…(……..(oxygen with branching), O……….. (oxygen),
O…=…(…(oxygen with double bond and branching), P…(…….(phosphorous with branching),
P………..(phosphorous), C4……0…(absence of four membered ring ), C3……0…( absence of three
membered ring), BOND10000000(one double bond), 1…(……., HALO00000000 (absence of halogen),=…O…(…(double
bond with oxygen and branching), P…(…O…(phosphorous with branched oxygen) etc and
some SMILES based promoters of pIC50 decrease are (…(…….(branching with further branching),
(………..(branching), C…(…….(methyl with branching), O…=…….(oxygen with double bond),
1………..(presence of one ring), (…P…(…(branch having phosphorous with branching) etc.
The structure generated using the promoters of increase in end point with their predicted
pIC50 are given in Table 6S (supplementary information). By using the promoters of endpoint, we have developed
some more hFPPS inhibitors. We have predicted the pIC50 by all QSAR models using their respective equation and then average pIC50 was calculated.
On the basis of average calculated pIC50, compound 14 was found most potent by using
model of split 3 but the difference between the maximum and minimum calculated pIC50
from all models is 3.8938. Compound 16 shows nearly constant value of calculated pIC50
and the difference between the maximum and minimum calculated pIC50 is 0.9059. The
graphical representation of all the calculated pIC50 with average calculated pIC50
is shown in [Fig. 2 ]. The result of Table 6S (supplementary information) clearly shows that the bisphosphonates developed as hFPPS
inhibitors are more potent inhibitors than monophosphonates. It has been reported
in literature [1 ]
[3 ]
[7 ]
[8 ]
[9 ]
[10 ]
[11 ]
[12 ] that the interactions of the bisphosphonates in the active site of the HFFPS are
highly conditioned by the protonation state of the functional groups but the method
used in this work does not consider this aspect. It is a limitation of this work.
Despite that the proposed QSAR models predicted the activity of reported bisphosphonates
and monophosphonates accurately.
Fig. 2 Graphical representation of calculated pIC50 from all models for all proposed compound.
Two 3D QSAR model developed by comparative molecular field analysis (CoMFA) method
for the some part of our dataset has been described by Fernández et al. [32 ] and Liu et al.[33 ]. The values for statistical features used by Fernández et al. [32 ] to examine the model were n=20, R2 =0.943, Q2 =0.586, F=63 and SEE=0.11 and by Liu et al. [33 ] to examine the model were n(training) =42, R2
(training) =0.975, n(test) =11 and R2
(test) =0.0.753. In the reported QSAR models, monophosphonates were not used, but have used
to develop a relationship between monophosphonate and bisphosphonates. So, we can
say that the developed QSAR models in the present manuscript have better statistical
quality as compared to the reported model in terms of internal as well as external
prediction criteria. Also, the resent QSAR models are one parameter 2D QSAR models
which are simple in interpretation and simple to apply.
Conclusion
Monte carlo optimization method using CORAL software has been used successfully for
designing a statistically robust QSAR models for human farnesyl pyrophosphate synthase
inhibitors. The best QSAR model described by DCW(7,13) optimal descriptor was obtained
from split 4. In the mechanistic interpretation, the presence of nitrogen with double
bond, a phosphate group and oxygen with branched double bond were found promoter of
endpoint. The discussed QSAR models satisfy the OECD conditions for a good QSAR model.
The outcomes of the applied methodology for defining AD reveal that maximum numbers
of molecules are within the defined AD. Thus, this approach can be used for generation
of new potential human farnesyl pyrophosphate synthase inhibitors.