Supplemental Data. A Statistical Model for HIV-1 Sequence Classification using the Subtype Analyser (STAR)

R. E. Myers1, C. V. Gale2, A. Harrison3, Y. Takeuchi1 P. Kellam2


Supplemental Data


Table 1.


STAR Identifier

Accession

Reference Strain

ORIGIN

Published Subtype

25AAA1A1_36

AF069670

Y

Somalia

A

26AAA1A1_25

AF004885

Y

Kenya

A

28AGAGAG_45

AJ251057

Y

W/C.Africa

A/AG

27A1A1XX_42

M62320

Y

Uganda

A1

51AAA2A2_12

AF286237

Y

Cyprus

A2

52AAA2A2_13

AF286238

Y

D.R. of Congo

A2

33AEAEAE_41

U54771

Y

Thailand

AE

34AEAEXX_41

U51189

Y

Thailand

AE

35AEAEAE_9

U51188

Y

C.A.R.

AE

36AEAEAE_9

AF197340

Y

C.A.R.

AE/E

30AGAGAG_14

AF063223

Y

Djibouti

AG

32AGAGAG_45

AJ251056

Y

W/C.Africa

AG

29AGAGAG_29

L39106

Y

Nigeria

AG

12BBBBBB_0

NC_001802

Y

Coffin,J.M. (Ed.);

B

13BBBBBB_44

U63632

Y

USA

B

14BBBBBB_44

U21135

Y

USA

B

15BBBBBB_22

M17451/ M12508

Y

Haiti

B

59CCCCCC_6

AF110967

Y

Botswana

C

60CCCCCC_23

AF067155

Y

India

C

61CCCCCC_15

U46016

Y

Ethiopia

C

8CDCDCD_40

AF289548

Y

Tanzania

CD

9CDCDCD_40

AF289550

Y

Tanzania

CD

11CDCDCD_40

AF289549

Y

Tanzania

CD

3DDDDDD_0

M22639 (U16633, Z2)

Y

not given

D

5DDDDDD_13

M27323

Y

D.R. of Congo

D

4DDXXDD_13

KO3454/X04414

Y

D.R. of Congo

D

10DDXXDD_42

U88824

Y

Uganda

D

43FFF1F1_7

AF005494

Y

Brazil

F

44FFF1F1_13

AF077336

Y

D.R. of Congo

F

45FFF1F1_25

AF075703

Y

Kenya

F1

46FFF1F1_1

AJ249238

Y

Africa

F1

47F2F2F2_10

AJ249237

Y

Cameroon

F2

49F2F2F2_10

AJ249239

Y

Cameroon

F3

62GGGGGG_25

AF061640

Y

Kenya

G

63GGGGGG_13

AF061642

Y

D.R. of Congo

G

64GGGGGG_13

AF084936

Y

D.R. of Congo

G

65HHHHHH_9

AF005496

Y

C.A.R.

H

68HHHHHH_4

AF190127

Y

Belgium (ori?)

H

69JJJJJJ_13

AF082394

Y

D.R. of Congo

J

70JJJJJJ_13

AF082395

Y

D.R. of Congo

J

1000KK_0

AJ249235

Y

D.R. of Congo

K

1001KK_0

AJ249239

Y

Cameroon

K

1NNNNNN_10

AJ006022

Y

Cameroon

N

2NNNNNN_10

AJ271370

Y

Cameroon

N

536AAAAAA_15

AF071474

N

Ethiopia

A

538AAA1A1_42

AF069671

N

Uganda

A

540AAA1XX_42

AF069673

N

Uganda

A

541AAAAAA_40

AF069669

N

Tanzania

A

542AAAAAA_42

AF069672

N

Uganda

A

500DDCDAA_42

AF075701

N

Uganda

A

569AAA2A2_31

AF286239

N

S. Korea

A2/D

619XXA2A2_31

AF286239

N

S. Korea

A2/D

552AEAEAE_41

AB052995

N

Thailand

AE

553AEAEAE_41

AF259954

N

Thailand

AE

556AEAEAE_33

AY008714

N

S.China

AE

557AEAEAE_33

AY008718

N

S.China

AE

558AEAEAE_41

AB032740

N

Thailand

AE

559AEAEAE_41

AB032741

N

Thailand

AE

560AEAEAE_41

AF197339

N

Thailand

AE

515BBBBAE_41

AF362994

N

Thailand

AE/B

555AEAEAE_41

AF197338

N

Thailand

AE/E

561AEAEAE_9

AF197341

N

C.A.R.

AE/E

544AGAGAG_38

AF107770

N

Sweden

AG

545AGAGAG_10

AF377954

N

Cameroon

AG

546AGAGAG_19

AB049811

N

Ghana

AG

547AGAGAG_10

AF377955

N

Cameroon

AG

548AGAGAG_45

AJ286133

N

W/C.Africa

AG

549Y3AGAG_19

AB052867

N

Ghana

AG

562Y1AGAG_10

AF377957

N

Cameroon

AG

550Y4AGGG_4

AJ276596

N

Belgium (Trans)

AG/C

564Y5AGCC_25

AJ276595

N

Kenya

AG/C

617XXAEAE_41

AF164485

N

Thailand

AGE

535Y1AGY1_5

AJ293865

N

Benin

AGJ

503BBBBBB_3

AF042102

N

Australia

B

504BBBBBB_3

AF042106

N

Australia

B

505BBBBBB_3

AF042104

N

Australia

B

506BBBBBB_3

AF042105

N

Australia

B

507BBBBBB_3

AF042103

N

Australia

B

508BBBBBB_3

AF042100

N

Australia

B

510BBBBBB_28

U23487

N

Manchester, UK

B

512BBBBBB_44

U26942

N

USA

B

514BBBBBB_3

AF042101

N

Australia

B

516BBBBBB_18

AJ271445

N

GB

B

517BBBBXX_39

AF086817

N

Taiwan

B

527BBBBBB_44

U26546

N

USA

B

528BBBBBB_44

AF286365

N

USA

B

529BBBBBB_44

AF049494

N

USA

B

530BBBBBB_44

AF049495

N

USA

B

531BBBBBB_44

AF069140

N

USA

B

571CCCCCC_32

AF286227

N

S.Africa

C

572CCCCCC_6

AF110964

N

Botswana

C

573CCCCXX_6

AF110966

N

Botswana

C

574CCCCCC_6

AF110962

N

Botswana

C

575CCCCCC_6

AF110963

N

Botswana

C

576CCCCCC_6

AF110965

N

Botswana

C

577CCCCCC_6

AF110972

N

Botswana

C

579CCCCXX_6

AF110971

N

Botswana

C

580CCCCCC_6

AF110969

N

Botswana

C

581CCCCCC_46

AF286224

N

Zambia

C

582CCCCCC_40

AF286235

N

Tanzania

C

583CCCCCC_23

AF067154

N

India

C

584CCCCCC_23

AB023804

N

India

C

585CCCCCC_23

AF067158

N

India

C

586CCCCCC_23

AF067157

N

India

C

587CCCCCC_23

AF067159

N

India

C

588CCCCAA_23

AF067156

N

India

C

590CCCCCC_33

AF286231

N

India

C

591CCCCXX_6

AF110968

N

Botswana

C

595CCCCCC_6

AF110961

N

Botswana

C

597CCCCCC_6

AF110973

N

Botswana

C

598CCCCCC_6

AF110974

N

Botswana

C

599CCCCCC_6

AF110975

N

Botswana

C

600CCCCCC_6

AF290028

N

Botswana

C

601CCCCCC_6

AF290030

N

Botswana

C

602CCCCCC_6

AF290027

N

Botswana

C

603CCCCXX_6

AF110979

N

Botswana

C

604CCCCXX_6

AF110980

N

Botswana

C

605CCCCXX_6

AF110981

N

Botswana

C

606CCCCCC_23

AF286232

N

India

C

607CCCCCC_6

AF110976

N

Botswana

C

608CCCCCC_6

AF110978

N

Botswana

C

609CCCCCC_6

AF110977

N

Botswana

C

610CCCCCC_24

AF286233

N

Israel

C

620XXCCCC_6

AF110966

N

Botswana

C

623XXCCCC_6

AF290029

N

Botswana

C

627XXBBF1_7

AF005495

N

Brazil

F

565F2F2F2_10

AF377956

N

Cameroon

F

563Y5AGAG_19

AF184155

N

Ghana

none

509BBBBBB_3

AF146728

N

Australia

none

511BBBBBB_44

AF004394

N

USA

none

513BBBBBB_44

AF070521

N

USA

none

519BBBBBB_37

AF256209

N

Spain

none

520BBBBBB_37

AF256210

N

Spain

none

521BBBBBB_37

AF256204

N

Spain

none

522BBBBBB_37

AF256206

N

Spain

none

523BBBBBB_37

AF256205

N

Spain

none

524BBBBXX_37

AF256211

N

Spain

none

525BBBBBB_37

AF256208

N

Spain

none

526BBBBBB_44

AF075719

N

USA

none

613OOOOOO_35

AJ302646

N

Senegal

O

614OOOOOO_35

AJ302647

N

Senegal

O



Key:

Set of HIV-1 genomes containing Gag, Pol and Env proteins was given an internal accession number of the form; 00AABBCC_11

00 – corresponded to the number of the genome (numbers 1-72 Los Alamos sequences, numbers >500 – sequences derived from GenBank).

AA – Two-letter code used to represent the subtype of the Gag sequence, eg AA = subtype A, AB = subtype A / B recombinant, XX = unassigned.

BB – Two-letter code used to represent the subtype of the Pol sequence.

CC – Two-letter code used to represent the subtype of the Env sequence.

_11 – This value is a key to country of origin, which is detailed within the table.


C.A.R. – Central African Republic

DR Congo – Democratic Republic of Congo

W/C Africa – West Coast of Africa

Equations



STAR score function:

1.1

S - score of query sequence against a subtype alignment.

i-j - amino acid positions from i (start) to j (stop) within query sequence.

fx - frequency of query sequence amino acid at the corresponding position within subtype specific alignment.

fy - frequency of query sequence amino acid at the corresponding position within the all-subtype alignment.

PDT – Positive Discriminant Threshold.

NDT – Negative Discriminant Threshold.


1.2

A score (S) is generated for the query sequence relative to each subtype alignment (n.=11) and each value of the distribution of scores transformed into a Z-score.

S – score.

S - average of all scores (n = 11).

S - standard deviation of all scores (n = 11).

1.3

Maximal Z-score (Zmax) for a subtype alignment is used to predict subtype of query sequence.




Hidden Markov Model Analysis


Preliminary attempts to create a subtyping tool compared a very simple PSSM model with the HMMer implementation of Hidden Markov Models. HMMer and the PSSM model performed to a very similar level, misclassifying 1 sequence each. Having established that the PSSM tool could perform as well as the HMM based tool, the HMM based tool was not investigated further. This was for practical considerations, the code for STAR was generated in-house allowing its’ systematics to be adjusted and the scoring function to be deconstructed, allowing analysis along the length of query sequences. HMMer as implemented also proved to be unable to separate the maximal (subtype predictive) scores from other subtype scores, which made it difficult to validate the accuracy of subtype predictions (Figure s1). It should be noted that the utilisation of HMMer could almost certainly have been modified to improve performance. Rather than making a single HMM that modelled the entire HIV-1 Pol (440 aa) sequence alignment, regions of inter- and intra- subtype variation could have been used to create a series of HMM’s that represented the sequence variation along the length of the region. A series of HMMs would also have allowed recombinant sequences to be analysed, assuming that the length of the alignment within the HMM provided sufficient coverage.














Figure s1. The frequency distribution of Z-score from the maximal subtype alignments, (solid line) n. = 141 (141 x 1) and non-maximal subtype alignments (dashed line) n. = 1410 (141 x 10). Data was generated using HMMer to reclassify the 141 sequences used to define the 11 subtype HMMs.