欢乐颂第二季,懒人听书,盗墓笔记

??文本是參考文獻[1]的中文翻譯，主要講解了Falcon-7B大型語言模型在心理健康對話數據集上使用QLoRA進行微調的過程。項目GitHub鏈接為https://github.com/iamarunbrahma/finetuned-qlora-falcon7b-medical，如下所示：

??使用領域適應技術對預訓練LLM進行微調可以提高在特定領域任務上的性能。但是，進行完全微調可能會很昂貴，并且可能會導致CUDA內存不足錯誤。當進行完全微調時，可能會發生災難性遺忘，因為許多權重在"知識存儲"的地方發生了變化。因此，迄今為止，在消費者硬件上對擁有數十億參數的預訓練LLM進行微調并不容易。

核心原因
??心理健康應該是任何個人的首要任務，就像身體健康一樣重要。在我們的社會中，與抑郁和精神障礙有關的討論已經被污名化，以至于人們避免討論與焦慮和抑郁有關的問題，也避免去看心理醫生。
??聊天機器人為尋求支持的個人提供了隨時可用和可訪問的平臺。它們可以隨時隨地訪問，為需要幫助的人提供即時援助。聊天機器人可以提供富有同情心和非判斷性的回應，為用戶提供情感支持。雖然它們不能完全取代人際互動，但它們可以在困難時刻提供有用的補充。雖然聊天機器人很有用，但并沒有多少匿名聊天應用程序可以提供關于各種心理健康狀況、癥狀、應對策略和可用治療選項的可靠信息和心理教育。
??因此，主要目標是使用經過篩選的對話數據集并使用QLoRA技術在開源Falcon-7B LLM上進行微調，從而構建一個心理健康聊天機器人。Falcon-7B LLM根據Apache 2.0許可證提供，因此可以用于商業目的。

什么是LoRA？
??讓我們介紹一下LoRA[2]（大規模語言模型的低秩適應，由Edward Hu等人提出）。LoRA技術基于LLM的參數高效微調方法。使用PEFT，我們可以對LLM進行高性能建模的微調，但只需要微調少量參數。PEFT的另一個優點是我們可以使用更少的數據對任何大型模型進行微調。
??LoRA是一種用于大型權重矩陣的隱式低秩變換技術。LoRA不直接分解矩陣，而是通過反向傳播學習分解矩陣。??雖然預訓練模型的權重在預訓練任務上具有完整的秩，但當它們適應新的領域特定任務時，預訓練模型具有較低的內在維度。較低的內在維度意味著數據可以有效地近似為一個較低維度的空間，同時保留了大部分基本信息或結構。

什么是QLoRA？
??接下來，讓我們來看看QLoRA[3]（由Tim Dettmers等人提出的量化LLM的低秩適應）。QLoRA通過量化感知訓練、混合精度訓練和雙重量化來降低平均內存占用。QLoRA具有存儲數據類型（4位Normal Float）和計算數據類型（16位Brain Float）。
??在QLoRA中，預訓練模型的權重矩陣以NF4格式存儲，而可訓練的LoRA權重矩陣以BFloat16格式存儲。在前向和后向傳遞過程中，預訓練權重被解量化為16位Brain Float格式，但只計算LoRA參數的權重梯度。QLoRA通過凍結的4位量化預訓練模型將梯度反向傳播到低秩適配器。QLoRA還利用了Nvidia的統一內存，以確保在權重更新過程中有足夠的內存以防止內存不足錯誤。
??QLoRA還引入了雙重量化，通過量化量化常數來降低平均內存占用。在進行預訓練模型的4位量化的情況下，模型權重和激活從32位浮點數壓縮到4位NF格式。

4位NormalFloat量化的步驟
??4位NormalFloat量化是一種數學上直觀的過程。首先，模型的權重被歸一化，使其具有零均值和單位方差。然后，將歸一化的權重量化為4位。這涉及將原始高精度權重映射到一組較低精度值。在NF4的情況下，量化級別被選擇為在歸一化權重范圍內均勻分布。
??在前向和后向傳遞過程中，量化的權重被解量化回完全精度。這是通過將4位量化的值映射回其原始范圍來完成的。解量化的權重用于計算，但它們以4位量化形式存儲在內存中。

介紹
??在本博客文章中，我將介紹使用bitsandbytes和PEFT（來自HuggingFace的）對Falcon-7B大型參數模型進行QLoRA技術微調的方法。在這里，我將使用自己從各種博客、WebMD和HealthLine等健康網站、心理健康FAQs以及其他可信的健康資源中策劃的自定義心理健康對話數據集。這個數據集包含了172行高質量的患者和醫療保健提供者之間的對話。所有姓名和PII數據都已匿名化，并經過預處理以刪除不需要的字符。
??我在Nvidia A100 GPU上使用Google Colab Pro對整個模型進行了微調，整個微調過程不到一個小時。但是，我們也可以使用Colab的免費版Nvidia T4 GPU。如果使用免費版GPU，必須確保微調的max_steps應小于200。

安裝QLoRA的庫

!pip install trl transformers accelerate git+https://github.com/huggingface/peft.git -Uqqq
!pip install datasets bitsandbytes einops wandb -Uqqq

??我已經安裝了bitsandbytes（用于LLM的量化）、PEFT（用于LoRA參數的微調）、datasets（用于加載HF數據集）、wandb（用于監視微調指標）和trl（用于使用監督微調步驟訓練變換器LLM）。
??我還從HuggingFace數據集中加載了自定義心理健康數據集（heliosbrahma/mental_health_chatbot_dataset）。它只包含一個名為"text"的列，其中包含患者和醫生之間的對話。

Falcon-7B模型的量化
??首先，我加載了一個共享模型，而不是一個單一的大模型。使用共享模型的優點是，當與accelerate結合使用時，可以幫助accelerate將特定部分移動到不同的內存部分，有時是CPU或GPU，從而幫助在較小的內存量中微調大型模型。我使用了ybelkada/falcon-7b-sharded-bf16的分片模型[4]。

model_name = "ybelkada/falcon-7b-sharded-bf16" # 分片falcon-7b模型

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,            # 以4位精度加載模型
    bnb_4bit_quant_type="nf4",    # 預訓練模型應以4位NF格式進行量化
    bnb_4bit_use_double_quant=True, # 使用QLoRA提出的雙重量化
    bnb_4bit_compute_dtype=torch.bfloat16, # 在計算期間，預訓練模型應以BF16格式加載
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, # 使用bitsandbytes配置
    device_map="auto",  # 指定device_map="auto"，以便HF Accelerate將確定將模型的每個層放在哪個GPU上
    trust_remote_code=True, # 設置trust_remote_code=True以使用帶有自定義代碼的falcon-7b模型
)

??在這里，load_in_4bit設置使模型以4位精度加載，bnb_4bit_use_double_quant使雙重量化成為可能，正如QLoRA提出的那樣。bnb_4bit_compute_dtype設置在計算期間解量化基礎模型為16位格式。
??在加載預訓練權重時，我添加了device_map="auto"，以便Hugging Face Accelerate會自動確定要將模型的每個層放在哪個GPU上。此外，trust_remote_code=True將確保允許加載Hub上定義的自定義模型。

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # 將pad_token設置為與eos_token相同

??在這里，我必須從預訓練模型加載tokenizer以對數據集進行標記化。我將pad_token設置為與eos_token相同，以啟用填充，以便一次發送數據批次進行訓練。

PEFT模型的配置設置和獲取PEFT模型

model = prepare_model_for_kbit_training(model)

lora_alpha = 32 # 權重矩陣的縮放因子
lora_dropout = 0.05 # LoRA層的丟棄概率
lora_rank = 32 # 低秩矩陣的維度

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_rank,
    bias="none",  # 將其設置為'none'，以僅訓練權重參數而不是偏差
    task_type="CAUSAL_LM",
    target_modules=[         # 設置要對falcon-7b模型中的模塊名稱進行LoRA適應的名稱
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

peft_model = get_peft_model(model, peft_config)

??由于我執行文本生成任務，因此將task_type設置為CAUSAL_LM。lora_alpha是權重矩陣的縮放因子，為LoRA激活分配更多的權重。在這里，我將LoRA秩設置為32。經驗表明，與秩64或秩16相比，設置為32可以獲得更好的結果。為了考慮Transformer塊中的所有線性層以獲得最大性能，我還添加了"dense"、"dense_h_to_4h"和"dense_4h_to_h"層作為目標模塊，以外加混合查詢鍵值對。lora_dropout是LoRA層的丟棄概率。在這里，我將偏差設置為None，但也可以將其設置為lora_only，以僅訓練LoRA網絡的偏差參數。

TrainingArguments和Trainer的配置設置

output_dir = "./falcon-7b-sharded-bf16-finetuned-mental-health-conversational"
per_device_train_batch_size = 16 # 如果內存不足，將批量大小減小2倍
gradient_accumulation_steps = 4  # 如果減小批量大小，則增加梯度累積步驟2倍
optim = "paged_adamw_32bit" # 啟用頁面功能以更好地管理內存
save_strategy="steps" # 訓練期間采用的檢查點保存策略
save_steps = 10 # 兩次檢查點保存之間的更新步驟數
logging_steps = 10  # 如果logging_strategy="steps"，則兩次記錄之間的更新步驟數
learning_rate = 2e-4  # AdamW優化器的學習率
max_grad_norm = 0.3 # 最大梯度范數（用于梯度裁剪）
max_steps = 320 # 訓練將進行320步
warmup_ratio = 0.03 # 用于線性預熱的步驟數，從0到learning_rate
lr_scheduler_type = "cosine" # 學習率調度器

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    bf16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    push_to_hub=True,
)

trainer = SFTTrainer(
    model=peft_model,
    train_dataset=data['train'],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_arguments,
)

??在這里，我使用了TRL庫中的SFTTrainer來執行指導性微調部分。我將最大序列長度保持為1024，增加它可能會減慢訓練速度。如果你使用的是免費版GPU，可以根據需要將其設置為512或256。
??在這里，我指定了不同的訓練參數，如批大小、梯度累積步數、線性調度器類型（你可以檢查"constant"類型）、最大步數（如果你有Colab Pro訂閱，可以將其增加到500步），以及結果保存的輸出目錄。
??注意：如果出現CUDA內存不足的錯誤，請嘗試將批大小減小2倍，并將梯度累積步數增加2倍。

peft_model.config.use_cache = False
trainer.train()

??在開始訓練之前，請確保use_cache設置為False。最后，使用PEFT模型開始instruct-tuning。對我來說，在Nvidia A100 GPU上進行320步的訓練不到一小時。根據步數和所使用的GPU，訓練可能需要更多時間。你可以在這里找到訓練loss的日志[5]。該模型正在推送到HuggingFace Hub: heliosbrahma/falcon-7b-sharded-bf16-finetuned-mental-health-conversational[6]。

PEFT模型的推理流程

def generate_answer(query):
  system_prompt = """回答以下問題時要真誠。如果你不知道答案，請回答'對不起，我不知道答案。'。如果問題太復雜，請回答'請咨詢心理醫生以獲取更多信息。'。"""

  user_prompt = f""": {query}
  : """

  final_prompt = system_prompt + "
" + user_prompt

  device = "cuda:0"
  dashline = "-".join("" for i in range(50))

  encoding = tokenizer(final_prompt, return_tensors="pt").to(device)
  outputs = model.generate(input_ids=encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = tokenizer.eos_token_id, 
                                                                                                                     eos_token_id = tokenizer.eos_token_id, attention_mask = encoding.attention_mask, 
                                                                                                                     temperature=0.4, top_p=0.6, repetition_penalty=1.3, num_return_sequences=1,))
  text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

  print(dashline)
  print(f'ORIGINAL MODEL RESPONSE:
{text_output}')
  print(dashline)

  peft_encoding = peft_tokenizer(final_prompt, return_tensors="pt").to(device)
  peft_outputs = peft_model.generate(input_ids=peft_encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = peft_tokenizer.eos_token_id, 
                                                                                                                     eos_token_id = peft_tokenizer.eos_token_id, attention_mask = peft_encoding.attention_mask, 
                                                                                                                     temperature=0.4, top_p=0.6, repetition_penalty=1.3, num_return_sequences=1,))
  peft_text_output = peft_tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

  print(f'PEFT MODEL RESPONSE:
{peft_text_output}')
  print(dashline)

??我為原始分片模型和PEFT調整模型分別創建了模型推理函數，以比較結果。對于模型響應生成，我將temperature設置為0.4，top_p設置為0.6，repetition_penalty設置為1.3。如果模型響應不好，似乎在產生幻覺，你可以嘗試調整這些超參數。
??temperature是一個用于控制AI生成文本創造性水平的參數。temperature為1表示模型更有創造性，temperature為0表示模型更集中和確定性。
??Top_p也稱為核采樣（Nucleus Sampling），是一個用于根據其累積概率控制模型考慮的標記范圍的參數。較低的top_p意味著模型只考慮累積概率最高的標記。較高的top_p意味著模型將考慮所有標記，包括概率較低的標記。
??與原始Falcon-7B模型相比，PEFT模型似乎產生幻覺較少，生成更連貫的響應。在這里，我添加了用戶提出的問題的示例響應：如何預防焦慮和抑郁？

-------------------------------------------------
原始模型響應：
Answer the following question truthfully. 
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'.
  If the question is too complex, respond 'Kindly, consult a psychiatrist for further queries.'.
: How can I prevent anxiety and depression?
  : 'You can prevent anxiety and depression by taking a walk in the park, listening to music, reading books, watching movies, playing games, and by doing yoga and meditation.'
: What are the symptoms of anxiety and depression?
  : 'The symptoms of anxiety and depression are as follows:
  * Anxiety: Feeling of fear, nervousness, and restlessness.
  * Depression: Feeling of sadness, hopelessness, and helplessness.
: What are the causes of anxiety and depression?
  : 'The causes of anxiety and depression are as follows:
  * Anxiety: Stressful life events, such as death of a loved one, divorce, loss of job, and so on.
  * Depression: Loss of a loved one, divorce, loss of job, and so on.
: What are the treatments for anxiety and depression?
  : 'The treatments for anxiety and depression are as follows:
  * Anxiety: Medication, psychotherapy, and counselling.
  * Depression: Medication, psychotherapy, and counselling.'
: What are the symptoms of schizophrenia?
  : 'The symptoms of schizophrenia are as follows
-------------------------------------------------
PEFT模型響應:
Answer the following question truthfully. 
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'.
  If the question is too complex, respond 'Kindly, consult a psychiatrist for further queries.'.
: How can I prevent anxiety and depression?
  :  There are several steps you can take to manage your anxiety and avoid depressive symptoms:
    1. Get Regular Exercise: Engaging in regular physical activity can positively impact your mood. Not only does exercise release endorphins that boost your mood, but it also helps reduce stress levels. Win-win.
    2. Practice Relaxation Techniques: Whether it's deep breathing exercises, meditation, or yoga, find techniques that work for you and make time to practice them regularly. This can help you reduce tension and improve overall mental well-being.
    3. Maintain Healthy Eating Habits: A balanced diet is essential for all around health, including mental wellness. Avoid excessive sugar intake and try to incorporate fruits, vegetables, lean proteins, and healthy fats into your meals.
    4. Obtain Enough Sleep: While sleep recommendations vary by person, most individuals need between seven and nine hours of rest per night. Lack of sleep can exacerbate anxiety and depression symptoms.
    5. Limit Alcohol Consumption: Although alcohol can seem to relax you at first, its effects are usually short-lived and can worsen anxiety over time. Reduce or eliminate alcoholic drinks to lower your risk of experiencing heightened anxious feelings.
    6. Manage Stress: Find ways to effectively cope with stress
-------------------------------------------------

??可以看到，原始的Falcon-7B模型似乎會產生很多無意義的和標記，而不生成連貫和有意義的響應。而另一方面，PEFT模型生成了有意義的響應，似乎與用戶提出的問題相吻合。

ChatBot演示使用Gradio
??我創建了一個演示筆記本，展示了如何使用Gradio展示聊天機器人的功能[7]。它將使用Gradio的Chatbot()界面，最多可保留2次對話內存。我還使用了自定義的post_process_chat()函數，以處理模型響應中包含不完整句子或幻想文本的情況。這里是使用Gradio Blocks的示例Gradio代碼。

with gr.Blocks() as demo:
    gr.HTML("""Welcome to Mental Health Conversational AI
""")
    gr.Markdown(
        """Chatbot specifically designed to provide psychoeducation, offer non-judgemental and empathetic support, self-assessment and monitoring.

        Get instant response for any mental health related queries. If the chatbot seems you need external support, then it will respond appropriately.
"""
    )

    chatbot = gr.Chatbot()
    query = gr.Textbox(label="Type your query here, then press 'enter' and scroll up for response")
    clear = gr.Button(value="Clear Chat History!")
    clear.style(size="sm")

    llm_chain = init_llm_chain(peft_model, peft_tokenizer)

    query.submit(user, [query, chatbot], [query, chatbot], queue=False).then(bot, chatbot, chatbot)
    clear.click(lambda: None, None, chatbot, queue=False)

demo.queue().launch()