Supplemental Data. A Statistical Model for HIV-1 Sequence Classification using the Subtype Analyser (STAR)
R. E. Myers1, C. V. Gale2, A. Harrison3, Y. Takeuchi1 P. Kellam2†
Table 1.
|
STAR Identifier |
Accession |
Reference Strain |
ORIGIN |
Published Subtype |
|
25AAA1A1_36 |
AF069670 |
Y |
Somalia |
A |
|
26AAA1A1_25 |
AF004885 |
Y |
Kenya |
A |
|
28AGAGAG_45 |
AJ251057 |
Y |
W/C.Africa |
A/AG |
|
27A1A1XX_42 |
M62320 |
Y |
Uganda |
A1 |
|
51AAA2A2_12 |
AF286237 |
Y |
Cyprus |
A2 |
|
52AAA2A2_13 |
AF286238 |
Y |
D.R. of Congo |
A2 |
|
33AEAEAE_41 |
U54771 |
Y |
Thailand |
AE |
|
34AEAEXX_41 |
U51189 |
Y |
Thailand |
AE |
|
35AEAEAE_9 |
U51188 |
Y |
C.A.R. |
AE |
|
36AEAEAE_9 |
AF197340 |
Y |
C.A.R. |
AE/E |
|
30AGAGAG_14 |
AF063223 |
Y |
Djibouti |
AG |
|
32AGAGAG_45 |
AJ251056 |
Y |
W/C.Africa |
AG |
|
29AGAGAG_29 |
L39106 |
Y |
Nigeria |
AG |
|
12BBBBBB_0 |
NC_001802 |
Y |
Coffin,J.M. (Ed.); |
B |
|
13BBBBBB_44 |
U63632 |
Y |
USA |
B |
|
14BBBBBB_44 |
U21135 |
Y |
USA |
B |
|
15BBBBBB_22 |
M17451/ M12508 |
Y |
Haiti |
B |
|
59CCCCCC_6 |
AF110967 |
Y |
Botswana |
C |
|
60CCCCCC_23 |
AF067155 |
Y |
India |
C |
|
61CCCCCC_15 |
U46016 |
Y |
Ethiopia |
C |
|
8CDCDCD_40 |
AF289548 |
Y |
Tanzania |
CD |
|
9CDCDCD_40 |
AF289550 |
Y |
Tanzania |
CD |
|
11CDCDCD_40 |
AF289549 |
Y |
Tanzania |
CD |
|
3DDDDDD_0 |
M22639 (U16633, Z2) |
Y |
not given |
D |
|
5DDDDDD_13 |
M27323 |
Y |
D.R. of Congo |
D |
|
4DDXXDD_13 |
KO3454/X04414 |
Y |
D.R. of Congo |
D |
|
10DDXXDD_42 |
U88824 |
Y |
Uganda |
D |
|
43FFF1F1_7 |
AF005494 |
Y |
Brazil |
F |
|
44FFF1F1_13 |
AF077336 |
Y |
D.R. of Congo |
F |
|
45FFF1F1_25 |
AF075703 |
Y |
Kenya |
F1 |
|
46FFF1F1_1 |
AJ249238 |
Y |
Africa |
F1 |
|
47F2F2F2_10 |
AJ249237 |
Y |
Cameroon |
F2 |
|
49F2F2F2_10 |
AJ249239 |
Y |
Cameroon |
F3 |
|
62GGGGGG_25 |
AF061640 |
Y |
Kenya |
G |
|
63GGGGGG_13 |
AF061642 |
Y |
D.R. of Congo |
G |
|
64GGGGGG_13 |
AF084936 |
Y |
D.R. of Congo |
G |
|
65HHHHHH_9 |
AF005496 |
Y |
C.A.R. |
H |
|
68HHHHHH_4 |
AF190127 |
Y |
Belgium (ori?) |
H |
|
69JJJJJJ_13 |
AF082394 |
Y |
D.R. of Congo |
J |
|
70JJJJJJ_13 |
AF082395 |
Y |
D.R. of Congo |
J |
|
1000KK_0 |
AJ249235 |
Y |
D.R. of Congo |
K |
|
1001KK_0 |
AJ249239 |
Y |
Cameroon |
K |
|
1NNNNNN_10 |
AJ006022 |
Y |
Cameroon |
N |
|
2NNNNNN_10 |
AJ271370 |
Y |
Cameroon |
N |
|
536AAAAAA_15 |
AF071474 |
N |
Ethiopia |
A |
|
538AAA1A1_42 |
AF069671 |
N |
Uganda |
A |
|
540AAA1XX_42 |
AF069673 |
N |
Uganda |
A |
|
541AAAAAA_40 |
AF069669 |
N |
Tanzania |
A |
|
542AAAAAA_42 |
AF069672 |
N |
Uganda |
A |
|
500DDCDAA_42 |
AF075701 |
N |
Uganda |
A |
|
569AAA2A2_31 |
AF286239 |
N |
S. Korea |
A2/D |
|
619XXA2A2_31 |
AF286239 |
N |
S. Korea |
A2/D |
|
552AEAEAE_41 |
AB052995 |
N |
Thailand |
AE |
|
553AEAEAE_41 |
AF259954 |
N |
Thailand |
AE |
|
556AEAEAE_33 |
AY008714 |
N |
S.China |
AE |
|
557AEAEAE_33 |
AY008718 |
N |
S.China |
AE |
|
558AEAEAE_41 |
AB032740 |
N |
Thailand |
AE |
|
559AEAEAE_41 |
AB032741 |
N |
Thailand |
AE |
|
560AEAEAE_41 |
AF197339 |
N |
Thailand |
AE |
|
515BBBBAE_41 |
AF362994 |
N |
Thailand |
AE/B |
|
555AEAEAE_41 |
AF197338 |
N |
Thailand |
AE/E |
|
561AEAEAE_9 |
AF197341 |
N |
C.A.R. |
AE/E |
|
544AGAGAG_38 |
AF107770 |
N |
Sweden |
AG |
|
545AGAGAG_10 |
AF377954 |
N |
Cameroon |
AG |
|
546AGAGAG_19 |
AB049811 |
N |
Ghana |
AG |
|
547AGAGAG_10 |
AF377955 |
N |
Cameroon |
AG |
|
548AGAGAG_45 |
AJ286133 |
N |
W/C.Africa |
AG |
|
549Y3AGAG_19 |
AB052867 |
N |
Ghana |
AG |
|
562Y1AGAG_10 |
AF377957 |
N |
Cameroon |
AG |
|
550Y4AGGG_4 |
AJ276596 |
N |
Belgium (Trans) |
AG/C |
|
564Y5AGCC_25 |
AJ276595 |
N |
Kenya |
AG/C |
|
617XXAEAE_41 |
AF164485 |
N |
Thailand |
AGE |
|
535Y1AGY1_5 |
AJ293865 |
N |
Benin |
AGJ |
|
503BBBBBB_3 |
AF042102 |
N |
Australia |
B |
|
504BBBBBB_3 |
AF042106 |
N |
Australia |
B |
|
505BBBBBB_3 |
AF042104 |
N |
Australia |
B |
|
506BBBBBB_3 |
AF042105 |
N |
Australia |
B |
|
507BBBBBB_3 |
AF042103 |
N |
Australia |
B |
|
508BBBBBB_3 |
AF042100 |
N |
Australia |
B |
|
510BBBBBB_28 |
U23487 |
N |
Manchester, UK |
B |
|
512BBBBBB_44 |
U26942 |
N |
USA |
B |
|
514BBBBBB_3 |
AF042101 |
N |
Australia |
B |
|
516BBBBBB_18 |
AJ271445 |
N |
GB |
B |
|
517BBBBXX_39 |
AF086817 |
N |
Taiwan |
B |
|
527BBBBBB_44 |
U26546 |
N |
USA |
B |
|
528BBBBBB_44 |
AF286365 |
N |
USA |
B |
|
529BBBBBB_44 |
AF049494 |
N |
USA |
B |
|
530BBBBBB_44 |
AF049495 |
N |
USA |
B |
|
531BBBBBB_44 |
AF069140 |
N |
USA |
B |
|
571CCCCCC_32 |
AF286227 |
N |
S.Africa |
C |
|
572CCCCCC_6 |
AF110964 |
N |
Botswana |
C |
|
573CCCCXX_6 |
AF110966 |
N |
Botswana |
C |
|
574CCCCCC_6 |
AF110962 |
N |
Botswana |
C |
|
575CCCCCC_6 |
AF110963 |
N |
Botswana |
C |
|
576CCCCCC_6 |
AF110965 |
N |
Botswana |
C |
|
577CCCCCC_6 |
AF110972 |
N |
Botswana |
C |
|
579CCCCXX_6 |
AF110971 |
N |
Botswana |
C |
|
580CCCCCC_6 |
AF110969 |
N |
Botswana |
C |
|
581CCCCCC_46 |
AF286224 |
N |
Zambia |
C |
|
582CCCCCC_40 |
AF286235 |
N |
Tanzania |
C |
|
583CCCCCC_23 |
AF067154 |
N |
India |
C |
|
584CCCCCC_23 |
AB023804 |
N |
India |
C |
|
585CCCCCC_23 |
AF067158 |
N |
India |
C |
|
586CCCCCC_23 |
AF067157 |
N |
India |
C |
|
587CCCCCC_23 |
AF067159 |
N |
India |
C |
|
588CCCCAA_23 |
AF067156 |
N |
India |
C |
|
590CCCCCC_33 |
AF286231 |
N |
India |
C |
|
591CCCCXX_6 |
AF110968 |
N |
Botswana |
C |
|
595CCCCCC_6 |
AF110961 |
N |
Botswana |
C |
|
597CCCCCC_6 |
AF110973 |
N |
Botswana |
C |
|
598CCCCCC_6 |
AF110974 |
N |
Botswana |
C |
|
599CCCCCC_6 |
AF110975 |
N |
Botswana |
C |
|
600CCCCCC_6 |
AF290028 |
N |
Botswana |
C |
|
601CCCCCC_6 |
AF290030 |
N |
Botswana |
C |
|
602CCCCCC_6 |
AF290027 |
N |
Botswana |
C |
|
603CCCCXX_6 |
AF110979 |
N |
Botswana |
C |
|
604CCCCXX_6 |
AF110980 |
N |
Botswana |
C |
|
605CCCCXX_6 |
AF110981 |
N |
Botswana |
C |
|
606CCCCCC_23 |
AF286232 |
N |
India |
C |
|
607CCCCCC_6 |
AF110976 |
N |
Botswana |
C |
|
608CCCCCC_6 |
AF110978 |
N |
Botswana |
C |
|
609CCCCCC_6 |
AF110977 |
N |
Botswana |
C |
|
610CCCCCC_24 |
AF286233 |
N |
Israel |
C |
|
620XXCCCC_6 |
AF110966 |
N |
Botswana |
C |
|
623XXCCCC_6 |
AF290029 |
N |
Botswana |
C |
|
627XXBBF1_7 |
AF005495 |
N |
Brazil |
F |
|
565F2F2F2_10 |
AF377956 |
N |
Cameroon |
F |
|
563Y5AGAG_19 |
AF184155 |
N |
Ghana |
none |
|
509BBBBBB_3 |
AF146728 |
N |
Australia |
none |
|
511BBBBBB_44 |
AF004394 |
N |
USA |
none |
|
513BBBBBB_44 |
AF070521 |
N |
USA |
none |
|
519BBBBBB_37 |
AF256209 |
N |
Spain |
none |
|
520BBBBBB_37 |
AF256210 |
N |
Spain |
none |
|
521BBBBBB_37 |
AF256204 |
N |
Spain |
none |
|
522BBBBBB_37 |
AF256206 |
N |
Spain |
none |
|
523BBBBBB_37 |
AF256205 |
N |
Spain |
none |
|
524BBBBXX_37 |
AF256211 |
N |
Spain |
none |
|
525BBBBBB_37 |
AF256208 |
N |
Spain |
none |
|
526BBBBBB_44 |
AF075719 |
N |
USA |
none |
|
613OOOOOO_35 |
AJ302646 |
N |
Senegal |
O |
|
614OOOOOO_35 |
AJ302647 |
N |
Senegal |
O |
Key:
Set of HIV-1 genomes containing Gag, Pol and Env proteins was given an internal accession number of the form; 00AABBCC_11
00 – corresponded to the number of the genome (numbers 1-72 Los Alamos sequences, numbers >500 – sequences derived from GenBank).
AA – Two-letter code used to represent the subtype of the Gag sequence, eg AA = subtype A, AB = subtype A / B recombinant, XX = unassigned.
BB – Two-letter code used to represent the subtype of the Pol sequence.
CC – Two-letter code used to represent the subtype of the Env sequence.
_11 – This value is a key to country of origin, which is detailed within the table.
C.A.R. – Central African Republic
DR Congo – Democratic Republic of Congo
W/C Africa – West Coast of Africa
Equations
STAR score function:
1.1

S - score of query sequence against a subtype alignment.
i-j - amino acid positions from i (start) to j (stop) within query sequence.
fx - frequency of query sequence amino acid at the corresponding position within subtype specific alignment.
fy - frequency of query sequence amino acid at the corresponding position within the all-subtype alignment.
PDT – Positive Discriminant Threshold.
NDT – Negative Discriminant Threshold.
1.2
A score (S) is generated for the query sequence relative to each subtype alignment (n.=11) and each value of the distribution of scores transformed into a Z-score.

S – score.
S - average of all scores (n = 11).
S - standard deviation of all scores (n = 11).
1.3
Maximal Z-score (Zmax) for a subtype alignment is used to predict subtype of query sequence.

Hidden Markov Model Analysis
Preliminary attempts to create a subtyping tool compared a very simple PSSM model with the HMMer implementation of Hidden Markov Models. HMMer and the PSSM model performed to a very similar level, misclassifying 1 sequence each. Having established that the PSSM tool could perform as well as the HMM based tool, the HMM based tool was not investigated further. This was for practical considerations, the code for STAR was generated in-house allowing its’ systematics to be adjusted and the scoring function to be deconstructed, allowing analysis along the length of query sequences. HMMer as implemented also proved to be unable to separate the maximal (subtype predictive) scores from other subtype scores, which made it difficult to validate the accuracy of subtype predictions (Figure s1). It should be noted that the utilisation of HMMer could almost certainly have been modified to improve performance. Rather than making a single HMM that modelled the entire HIV-1 Pol (440 aa) sequence alignment, regions of inter- and intra- subtype variation could have been used to create a series of HMM’s that represented the sequence variation along the length of the region. A series of HMMs would also have allowed recombinant sequences to be analysed, assuming that the length of the alignment within the HMM provided sufficient coverage.

Figure s1. The frequency distribution of Z-score from the maximal subtype alignments, (solid line) n. = 141 (141 x 1) and non-maximal subtype alignments (dashed line) n. = 1410 (141 x 10). Data was generated using HMMer to reclassify the 141 sequences used to define the 11 subtype HMMs.