lnp_ml/ARCHITECTURE.md
2026-01-18 21:50:21 +08:00

115 lines
4.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Model Architecture
1. Input(8 tokens)
```
# chem token([B, 600])
SMILES(string) + mpnn_encoder -> chem token
# morgan token([B, 1024]), maccs token([B, 167]), rdkit token([B, 210])
SMILES(string) + rdkit_encoder-> chem token, morgan token, maccs token, rdkit token
# comp token([B, 5])
Cationic_Lipid_to_mRNA_weight_ratio(float)
Cationic_Lipid_Mol_Ratio(float)
Phospholipid_Mol_Ratio(float)
Cholesterol_Mol_Ratio(float)
PEG_Lipid_Mol_Ratio(float)
# phys token([B, 12])
Purity_Pure(one-hot for Purity)
Purity_Crude(one-hot for Purity)
Mix_type_Microfluidic(one-hot for Mix_type)
Mix_type_Microfluidic(one-hot for Mix_type)
Cargo_type_mRNA(one-hot for Cargo_type)
Cargo_type_pDNA(one-hot for Cargo_type)
Cargo_type_siRNA(one-hot for Cargo_type)
Target_or_delivered_gene_FFL(one-hot for Target_or_delivered_gene)
Target_or_delivered_gene_Peptide_barcode(one-hot for Target_or_delivered_gene)
Target_or_delivered_gene_hEPO(one-hot for Target_or_delivered_gene)
Target_or_delivered_gene_FVII(one-hot for Target_or_delivered_gene)
Target_or_delivered_gene_GFP(one-hot for Target_or_delivered_gene)
# help token([B, 4])
Helper_lipid_ID_DOPE(one-hot for Helper_lipid_ID)
Helper_lipid_ID_DOTAP(one-hot for Helper_lipid_ID)
Helper_lipid_ID_DSPC(one-hot for Helper_lipid_ID)
Helper_lipid_ID_MDOA(one-hot for Helper_lipid_ID)
# exp token([B, 32])
Model_type_A549one-hot for Model_type
Model_type_BDMCone-hot for Model_type
Model_type_BMDMone-hot for Model_type
Model_type_HBEC_ALIone-hot for Model_type
Model_type_HEK293Tone-hot for Model_type
Model_type_HeLaone-hot for Model_type
Model_type_IGROV1one-hot for Model_type
Model_type_Mouseone-hot for Model_type
Model_type_RAW264p7one-hot for Model_type
Delivery_target_dendritic_cellone-hot for Delivery_target
Delivery_target_generic_cellone-hot for Delivery_target
Delivery_target_liverone-hot for Delivery_target
Delivery_target_lungone-hot for Delivery_target
Delivery_target_lung_epitheliumone-hot for Delivery_target
Delivery_target_macrophageone-hot for Delivery_target
Delivery_target_muscleone-hot for Delivery_target
Delivery_target_spleenone-hot for Delivery_target
Delivery_target_bodyone-hot for Delivery_target
Route_of_administration_in_vitroone-hot for Route_of_administration
Route_of_administration_intravenousone-hot for Route_of_administration
Route_of_administration_intramuscularone-hot for Route_of_administration
Route_of_administration_intratrachealone-hot for Route_of_administration
Sample_organization_type_individualone-hot for Sample_organization_type
Sample_organization_type_barcodedone-hot for Sample_organization_type
Value_name_log_luminescence(one-hot for Value_name)
Value_name_luminescence(one-hot for Value_name)
Value_name_FFL_silencing(one-hot for Value_name)
Value_name_Peptide_abundance(one-hot for Value_name)
Value_name_hEPO(one-hot for Value_name)
Value_name_FVII_silencing(one-hot for Value_name)
Value_name_GFP_delivery(one-hot for Value_name)
Value_name_Discretized_luminescence(one-hot for Value_name)
```
2. token projector
3. Channel split: to A[chem, Morgan, Maccs, rdkit] and B[comp, pays, help, exp]
4. Bi-directional cross attention
5. Token re-compose: back to [chem, Morgan, maccs, rdkit, comp, physical, help, exp]
6. Fusion layer(depends on user choice)
7. heads
```
# regression head
size(float, training data already logged)
# classification head
toxic(boolean, 0/1)
# regression head
quantified_delivery(float, training data already z-scored)
# classification head
PDI_0_0to0_2(one-hot classification for PDI)
PDI_0_2to0_3(one-hot classification for PDI)
PDI_0_3to0_4(one-hot classification for PDI)
PDI_0_4to0_5(one-hot classification for PDI)
# classification head
Encapsulation_Efficiency_EE<50(one-hot classification for Encapsulation_Efficiency)
Encapsulation_Efficiency_50<=EE<80(one-hot classification for Encapsulation_Efficiency)
Encapsulation_Efficiency_80<EE<=100(one-hot classification for Encapsulation_Efficiency)
# distribution head
Biodistribution_lymph_nodes(float, sum of Biodistribution is 1)
Biodistribution_heart(float, sum of Biodistribution is 1)
Biodistribution_liver(float, sum of Biodistribution is 1)
Biodistribution_spleen(float, sum of Biodistribution is 1)
Biodistribution_lung(float, sum of Biodistribution is 1)
Biodistribution_kidney(float, sum of Biodistribution is 1)
Biodistribution_muscle(float, sum of Biodistribution is 1)
```