# Flagembedding > .. autoclass:: FlagEmbedding.abc.evaluation.AbsEvalArgs --- Arguments ========= .. autoclass:: FlagEmbedding.abc.evaluation.AbsEvalArgs .. autoclass:: FlagEmbedding.abc.evaluation.AbsEvalModelArgs --- dataset loader ============== .. autoclass:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader Methods ------- .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader.available_dataset_names .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader.available_splits .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader.check_dataset_names .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader.check_splits .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader.load_corpus .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader.load_qrels .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader.load_queries .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._load_remote_corpus .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._load_remote_qrels .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._load_remote_queries .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._load_local_corpus .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._load_local_qrels .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._load_local_queries .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._download_file .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._get_fpath_size .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._download_gz_file .. automethod:: FlagEmbedding.abc.evaluation.AbsEvalDataLoader._download_zip_file --- Evaluator ========= .. autoclass:: FlagEmbedding.abc.evaluation.AbsEvaluator --- runner ====== .. autoclass:: FlagEmbedding.abc.evaluation.AbsEvalRunner --- ======== searcher ======== EvalRetriever ============= .. autoclass:: FlagEmbedding.abc.evaluation.EvalRetriever EvalDenseRetriever ================== .. autoclass:: FlagEmbedding.abc.evaluation.EvalDenseRetriever EvalReranker ============ .. autoclass:: FlagEmbedding.abc.evaluation.EvalReranker --- Evaluation ========== .. toctree:: evaluation/arguments evaluation/data_loader evaluation/searcher evaluation/evaluator evaluation/runner --- AbsArguments ============ .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModelArguments .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsRerankerDataArguments --- ========== AbsDataset ========== AbsEmbedderTrainDataset ======================= .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderTrainDataset Methods ------- .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderTrainDataset._load_dataset .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderTrainDataset._shuffle_text AbsEmbedderCollator =================== .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderCollator AbsEmbedderSameDatasetTrainDataset ================================== .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderSameDatasetTrainDataset Methods ------- .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderSameDatasetTrainDataset.refresh_epoch .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderSameDatasetTrainDataset._load_dataset .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderSameDatasetTrainDataset._get_file_batch_size .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderSameDatasetTrainDataset._get_train_group_size .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderSameDatasetTrainDataset._create_batch_data AbsEmbedderSameDatasetCollator ============================== .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderSameDatasetCollator EmbedderTrainerCallbackForDataRefresh ===================================== .. autoclass:: FlagEmbedding.abc.finetune.embedder.EmbedderTrainerCallbackForDataRefresh Methods ------- .. automethod:: FlagEmbedding.abc.finetune.embedder.EmbedderTrainerCallbackForDataRefresh.on_epoch_end --- =========== AbsModeling =========== AbsEmbedderModel ================ .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel Methods ------- .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel.encode .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel.compute_loss .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel.compute_score .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel.save .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel.get_local_score .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel.compute_local_score .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel.forward .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel.distill_loss .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel._compute_no_in_batch_neg_loss .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel._compute_in_batch_neg_loss .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel._compute_cross_device_neg_loss .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModel._dist_gather_tensor EmbedderOutput ============== .. autoclass:: FlagEmbedding.abc.finetune.embedder.EmbedderOutput --- ========= AbsRunner ========= AbsEmbedderTrainer ================== .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderRunner Methods ------- .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderRunner.load_tokenizer_and_model .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderRunner.load_trainer .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderRunner.load_train_dataset .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderRunner.load_data_collator .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderRunner.run --- ========== AbsTrainer ========== AbsEmbedderTrainer ================== .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderTrainer Methods ------- .. automethod:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderTrainer.compute_loss --- Embedder ======== .. toctree:: embedder/AbsArguments embedder/AbsDataset embedder/AbsModeling embedder/AbsTrainer embedder/AbsRunner --- AbsArguments ============ .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderModelArguments .. autoclass:: FlagEmbedding.abc.finetune.embedder.AbsEmbedderDataArguments --- ========== AbsDataset ========== AbsRerankerTrainDataset ======================= .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsRerankerTrainDataset Methods ------- .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerTrainDataset.create_one_example .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerTrainDataset._load_dataset .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerTrainDataset._shuffle_text AbsRerankerCollator =================== .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsRerankerCollator AbsLLMRerankerTrainDataset ========================== .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsLLMRerankerTrainDataset AbsLLMRerankerCollator ====================== .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsLLMRerankerCollator --- =========== AbsModeling =========== AbsRerankerModel ================ .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModel Methods ------- .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModel.encode .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModel.gradient_checkpointing_enable .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModel.enable_input_require_grads .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModel.forward .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModel.compute_loss .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModel.save .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerModel.save_pretrained RerankerOutput ============== .. autoclass:: FlagEmbedding.abc.finetune.reranker.RerankerOutput --- ========= AbsRunner ========= AbsRerankerTrainer ================== .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsRerankerRunner Methods ------- .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerRunner.load_tokenizer_and_model .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerRunner.load_trainer .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerRunner.load_train_dataset .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerRunner.load_data_collator .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerRunner.run --- ========== AbsTrainer ========== AbsRerankerTrainer ================== .. autoclass:: FlagEmbedding.abc.finetune.reranker.AbsRerankerTrainer Methods ------- .. automethod:: FlagEmbedding.abc.finetune.reranker.AbsRerankerTrainer.compute_loss --- Reranker ======== .. toctree:: reranker/AbsArguments reranker/AbsDataset reranker/AbsModeling reranker/AbsTrainer reranker/AbsRunner --- Finetune ======== .. toctree:: finetune/embedder finetune/reranker --- AbsEmbedder =========== .. autoclass:: FlagEmbedding.abc.inference.AbsEmbedder Methods ------- .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.get_target_devices .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.get_detailed_instruct .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.encode_queries .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.encode_corpus .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.encode .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.encode_single_device .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.start_multi_process_pool .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder._encode_multi_process_worker .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.stop_multi_process_pool .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder.encode_multi_process .. automethod:: FlagEmbedding.abc.inference.AbsEmbedder._concatenate_results_from_multi_process --- AbsReranker =========== .. autoclass:: FlagEmbedding.abc.inference.AbsReranker Methods ------- .. automethod:: FlagEmbedding.abc.inference.AbsReranker.get_target_devices .. automethod:: FlagEmbedding.abc.inference.AbsReranker.get_detailed_instruct .. automethod:: FlagEmbedding.abc.inference.AbsReranker.get_detailed_inputs .. automethod:: FlagEmbedding.abc.inference.AbsReranker.compute_score .. automethod:: FlagEmbedding.abc.inference.AbsReranker.compute_score_single_gpu .. automethod:: FlagEmbedding.abc.inference.AbsReranker.start_multi_process_pool .. automethod:: FlagEmbedding.abc.inference.AbsReranker.encode_multi_process .. automethod:: FlagEmbedding.abc.inference.AbsReranker._encode_multi_process_worker .. automethod:: FlagEmbedding.abc.inference.AbsReranker.stop_multi_process_pool --- Inference ========= .. toctree:: inference/AbsEmbedder inference/AbsReranker --- Abstract Class ============== .. toctree:: abc/inference abc/evaluation abc/finetune --- arguments ========= .. autoclass:: FlagEmbedding.evaluation.air_bench.AIRBenchEvalModelArgs --- runner ====== .. autoclass:: FlagEmbedding.evaluation.air_bench.AIRBenchEvalRunner --- AIR-Bench ========= `AIR-Bench `_ (Automated heterogeneous Information Retrieval Benchmark) is a dynamic (actively being updated) benchmark for information retrieval. Now the benchmark contains two versions. Notice that the testing data is generated by LLMs with out human intervention. This helps the evaluation of new domains easier and faster to be updated. It also makes it impossible for any models to have the test data covered in their training sets. You can evaluate model's performance on AIR-Bench by running our provided shell script: .. code:: bash chmod +x /examples/evaluation/air_bench/eval_air_bench.sh ./examples/evaluation/air_bench/eval_air_bench.sh Or by running: .. code:: bash python -m FlagEmbedding.evaluation.air_bench \ --benchmark_version AIR-Bench_24.05 \ --task_types qa long-doc \ --domains arxiv \ --languages en \ --splits dev test \ --output_dir ./air_bench/search_results \ --search_top_k 1000 \ --rerank_top_k 100 \ --cache_dir /root/.cache/huggingface/hub \ --overwrite False \ --embedder_name_or_path BAAI/bge-m3 \ --reranker_name_or_path BAAI/bge-reranker-v2-m3 \ --devices cuda:0 cuda:1 \ --model_cache_dir /root/.cache/huggingface/hub \ --reranker_max_length 1024 change the embedder, reranker, devices and cache directory to your preference. .. toctree:: :hidden: airbench/arguments airbench/runner --- arguments ========= .. autoclass:: FlagEmbedding.evaluation.beir.arguments.BEIREvalArgs --- data loader =========== .. autoclass:: FlagEmbedding.evaluation.beir.data_loader.BEIREvalDataLoader --- evaluator ========= .. autoclass:: FlagEmbedding.evaluation.beir.evaluator.BEIREvaluator --- runner ====== .. autoclass:: FlagEmbedding.evaluation.beir.BEIREvalRunner --- BEIR ==== `BEIR `_ (Benchmarking-IR) is a heterogeneous evaluation benchmark for information retrieval. It is designed for evaluating the performance of NLP-based retrieval models and widely used by research of modern embedding models. You can evaluate model's performance on the BEIR benchmark by running our provided shell script: .. code:: bash chmod +x /examples/evaluation/beir/eval_beir.sh ./examples/evaluation/beir/eval_beir.sh Or by running: .. code:: bash python -m FlagEmbedding.evaluation.beir \ --eval_name beir \ --dataset_dir ./beir/data \ --dataset_names fiqa arguana cqadupstack \ --splits test dev \ --corpus_embd_save_dir ./beir/corpus_embd \ --output_dir ./beir/search_results \ --search_top_k 1000 \ --rerank_top_k 100 \ --cache_path /root/.cache/huggingface/hub \ --overwrite False \ --k_values 10 100 \ --eval_output_method markdown \ --eval_output_path ./beir/beir_eval_results.md \ --eval_metrics ndcg_at_10 recall_at_100 \ --ignore_identical_ids True \ --embedder_name_or_path BAAI/bge-large-en-v1.5 \ --reranker_name_or_path BAAI/bge-reranker-v2-m3 \ --devices cuda:0 cuda:1 \ --reranker_max_length 1024 \ change the embedder, devices and cache directory to your preference. .. toctree:: :hidden: beir/arguments beir/data_loader beir/evaluator beir/runner --- data_loader =========== .. autoclass:: FlagEmbedding.evaluation.miracl.MIRACLEvalDataLoader Methods ------- .. automethod:: FlagEmbedding.evaluation.miracl.MIRACLEvalDataLoader.available_dataset_names .. automethod:: FlagEmbedding.evaluation.miracl.MIRACLEvalDataLoader.available_splits .. automethod:: FlagEmbedding.evaluation.miracl.MIRACLEvalDataLoader._load_remote_corpus .. automethod:: FlagEmbedding.evaluation.miracl.MIRACLEvalDataLoader._load_remote_qrels .. automethod:: FlagEmbedding.evaluation.miracl.MIRACLEvalDataLoader._load_remote_queries --- runner ====== .. autoclass:: FlagEmbedding.evaluation.miracl.MIRACLEvalRunner :members: --- MIRACL ====== `MIRACL `_ (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages. They release a multilingual retrieval dataset containing the train and dev set for 16 "known languages" and only dev set for 2 "surprise languages". The topics are generated by native speakers of each language, who also label the relevance between the topics and a given document list. You can found the `dataset `_ on HuggingFace. You can evaluate model's performance on MIRACL simply by running our provided shell script: .. code:: bash chmod +x /examples/evaluation/miracl/eval_miracl.sh ./examples/evaluation/miracl/eval_miracl.sh Or by running: .. code:: bash python -m FlagEmbedding.evaluation.miracl \ --eval_name miracl \ --dataset_dir ./miracl/data \ --dataset_names bn hi sw te th yo \ --splits dev \ --corpus_embd_save_dir ./miracl/corpus_embd \ --output_dir ./miracl/search_results \ --search_top_k 1000 \ --rerank_top_k 100 \ --cache_path /root/.cache/huggingface/hub \ --overwrite False \ --k_values 10 100 \ --eval_output_method markdown \ --eval_output_path ./miracl/miracl_eval_results.md \ --eval_metrics ndcg_at_10 recall_at_100 \ --embedder_name_or_path BAAI/bge-m3 \ --reranker_name_or_path BAAI/bge-reranker-v2-m3 \ --devices cuda:0 cuda:1 \ --cache_dir /root/.cache/huggingface/hub \ --reranker_max_length 1024 change the embedder, reranker, devices and cache directory to your preference. .. toctree:: :hidden: miracl/data_loader miracl/runner --- data_loader =========== .. autoclass:: FlagEmbedding.evaluation.mkqa.MKQAEvalDataLoader Methods ------- .. automethod:: FlagEmbedding.evaluation.mkqa.MKQAEvalDataLoader.available_dataset_names .. automethod:: FlagEmbedding.evaluation.mkqa.MKQAEvalDataLoader.available_splits .. automethod:: FlagEmbedding.evaluation.mkqa.MKQAEvalDataLoader.load_corpus .. automethod:: FlagEmbedding.evaluation.mkqa.MKQAEvalDataLoader._load_local_qrels .. automethod:: FlagEmbedding.evaluation.mkqa.MKQAEvalDataLoader._load_remote_corpus .. automethod:: FlagEmbedding.evaluation.mkqa.MKQAEvalDataLoader._load_remote_qrels .. automethod:: FlagEmbedding.evaluation.mkqa.MKQAEvalDataLoader._load_remote_queries --- evaluator ========= .. autoclass:: FlagEmbedding.evaluation.mkqa.MKQAEvaluator :members: --- runner ====== .. autoclass:: FlagEmbedding.evaluation.mkqa.MKQAEvalRunner :members: --- MKQA ==== `MKQA `_ is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages. The queries are sampled from the [Google Natural Questions Dataset](https://github.com/google-research-datasets/natural-questions). Each example in the dataset has the following structure: .. code:: bash { 'example_id': 563260143484355911, 'queries': { 'en': "who sings i hear you knocking but you can't come in", 'ru': "кто поет i hear you knocking but you can't come in", 'ja': '「 I hear you knocking」は誰が歌っていますか', 'zh_cn': "《i hear you knocking but you can't come in》是谁演唱的", ... }, 'query': "who sings i hear you knocking but you can't come in", 'answers': { 'en': [{ 'type': 'entity', 'entity': 'Q545186', 'text': 'Dave Edmunds', 'aliases': [], }], 'ru': [{ 'type': 'entity', 'entity': 'Q545186', 'text': 'Эдмундс, Дэйв', 'aliases': ['Эдмундс', 'Дэйв Эдмундс', 'Эдмундс Дэйв', 'Dave Edmunds'], }], 'ja': [{ 'type': 'entity', 'entity': 'Q545186', 'text': 'デイヴ・エドモンズ', 'aliases': ['デーブ・エドモンズ', 'デイブ・エドモンズ'], }], 'zh_cn': [{ 'type': 'entity', 'text': '戴维·埃德蒙兹 ', 'entity': 'Q545186', }], ... }, } You can evaluate model's performance on MKQA simply by running our provided shell script: .. code:: bash chmod +x /examples/evaluation/mkqa/eval_mkqa.sh ./examples/evaluation/mkqa/eval_mkqa.sh Or by running: .. code:: bash python -m FlagEmbedding.evaluation.mkqa \ --eval_name mkqa \ --dataset_dir ./mkqa/data \ --dataset_names en zh_cn \ --splits test \ --corpus_embd_save_dir ./mkqa/corpus_embd \ --output_dir ./mkqa/search_results \ --search_top_k 1000 \ --rerank_top_k 100 \ --cache_path /root/.cache/huggingface/hub \ --overwrite False \ --k_values 20 \ --eval_output_method markdown \ --eval_output_path ./mkqa/mkqa_eval_results.md \ --eval_metrics qa_recall_at_20 \ --embedder_name_or_path BAAI/bge-m3 \ --reranker_name_or_path BAAI/bge-reranker-v2-m3 \ --devices cuda:0 cuda:1 \ --cache_dir /root/.cache/huggingface/hub \ --reranker_max_length 1024 change the embedder, reranker, devices and cache directory to your preference. .. toctree:: :hidden: mkqa/data_loader mkqa/evaluator mkqa/runner --- data_loader =========== .. autoclass:: FlagEmbedding.evaluation.mldr.MLDREvalDataLoader Methods ------- .. automethod:: FlagEmbedding.evaluation.mldr.MLDREvalDataLoader.available_dataset_names .. automethod:: FlagEmbedding.evaluation.mldr.MLDREvalDataLoader.available_splits .. automethod:: FlagEmbedding.evaluation.mldr.MLDREvalDataLoader._load_remote_corpus .. automethod:: FlagEmbedding.evaluation.mldr.MLDREvalDataLoader._load_remote_qrels .. automethod:: FlagEmbedding.evaluation.mldr.MLDREvalDataLoader._load_remote_queries --- runner ====== .. autoclass:: FlagEmbedding.evaluation.mldr.MLDREvalRunner :members: --- MLDR ==== `MLDR `_ is a Multilingual Long-Document Retrieval dataset built on Wikipeida, Wudao and mC4, covering 13 typologically diverse languages. Specifically, we sample lengthy articles from Wikipedia, Wudao and mC4 datasets and randomly choose paragraphs from them. Then we use GPT-3.5 to generate questions based on these paragraphs. The generated question and the sampled article constitute a new text pair to the dataset. An example of ``train`` set looks like: .. code:: bash { 'query_id': 'q-zh-<...>', 'query': '...', 'positive_passages': [ { 'docid': 'doc-zh-<...>', 'text': '...' } ], 'negative_passages': [ { 'docid': 'doc-zh-<...>', 'text': '...' }, ... ] } An example of ``dev`` and ``test`` set looks like: .. code:: bash { 'query_id': 'q-zh-<...>', 'query': '...', 'positive_passages': [ { 'docid': 'doc-zh-<...>', 'text': '...' } ], 'negative_passages': [] } An example of ``corpus`` looks like: .. code:: bash { 'docid': 'doc-zh-<...>', 'text': '...' } You can evaluate model's performance on MLDR simply by running our provided shell script: .. code:: bash chmod +x /examples/evaluation/mldr/eval_mldr.sh ./examples/evaluation/mldr/eval_mldr.sh Or by running: .. code:: bash python -m FlagEmbedding.evaluation.mldr \ --eval_name mldr \ --dataset_dir ./mldr/data \ --dataset_names hi \ --splits test \ --corpus_embd_save_dir ./mldr/corpus_embd \ --output_dir ./mldr/search_results \ --search_top_k 1000 \ --rerank_top_k 100 \ --cache_path /root/.cache/huggingface/hub \ --overwrite False \ --k_values 10 100 \ --eval_output_method markdown \ --eval_output_path ./mldr/mldr_eval_results.md \ --eval_metrics ndcg_at_10 \ --embedder_name_or_path BAAI/bge-m3 \ --reranker_name_or_path BAAI/bge-reranker-v2-m3 \ --devices cuda:0 cuda:1 \ --cache_dir /root/.cache/huggingface/hub \ --embedder_passage_max_length 8192 \ --reranker_max_length 8192 change the args of embedder, reranker, devices and cache directory to your preference. .. toctree:: :hidden: mldr/data_loader mldr/runner --- data_loader =========== .. autoclass:: FlagEmbedding.evaluation.msmarco.MSMARCOEvalDataLoader Methods ------- .. automethod:: FlagEmbedding.evaluation.msmarco.MSMARCOEvalDataLoader.available_dataset_names .. automethod:: FlagEmbedding.evaluation.msmarco.MSMARCOEvalDataLoader.available_splits .. automethod:: FlagEmbedding.evaluation.msmarco.MSMARCOEvalDataLoader._load_remote_corpus .. automethod:: FlagEmbedding.evaluation.msmarco.MSMARCOEvalDataLoader._load_remote_qrels .. automethod:: FlagEmbedding.evaluation.msmarco.MSMARCOEvalDataLoader._load_remote_queries --- runner ====== .. autoclass:: FlagEmbedding.evaluation.msmarco.MSMARCOEvalRunner :members: --- MSMARCO ======= `MS Marco `_ (Microsoft MAchine Reading Comprehension) is a large scale real-world reading comprehension dataset. It is widely used in information retrieval, question answering, and natural language processing research. You can evaluate model's performance on MS MARCO simply by running our provided shell script: .. code:: bash chmod +x /examples/evaluation/msmarco/eval_msmarco.sh ./examples/evaluation/msmarco/eval_msmarco.sh Or by running: .. code:: bash python -m FlagEmbedding.evaluation.msmarco \ --eval_name msmarco \ --dataset_dir ./msmarco/data \ --dataset_names passage \ --splits dev \ --corpus_embd_save_dir ./msmarco/corpus_embd \ --output_dir ./msmarco/search_results \ --search_top_k 1000 \ --rerank_top_k 100 \ --cache_path /root/.cache/huggingface/hub \ --overwrite True \ --k_values 10 100 \ --eval_output_method markdown \ --eval_output_path ./msmarco/msmarco_eval_results.md \ --eval_metrics ndcg_at_10 recall_at_100 \ --embedder_name_or_path BAAI/bge-large-en-v1.5 \ --reranker_name_or_path BAAI/bge-reranker-v2-m3 \ --devices cuda:0 cuda:1 cuda:2 cuda:3 cuda:4 cuda:5 cuda:6 cuda:7 \ --cache_dir /root/.cache/huggingface/hub \ --reranker_max_length 1024 change the embedder, reranker, devices and cache directory to your preference. .. toctree:: :hidden: msmarco/data_loader msmarco/runner --- arguments ========= .. autoclass:: FlagEmbedding.evaluation.mteb.arguments.MTEBEvalArgs --- runner ====== .. autoclass:: FlagEmbedding.evaluation.mteb.runner.MTEBEvalRunner --- searcher ======== .. autoclass:: FlagEmbedding.evaluation.mteb.searcher.MTEBEvalDenseRetriever .. autoclass:: FlagEmbedding.evaluation.mteb.searcher.MTEBEvalReranker --- MTEB ==== `MTEB `_ (The Massive Text Embedding Benchmark) is a large-scale evaluation framework designed to assess the performance of text embedding models across a wide variety of NLP tasks. Introduced to standardize and improve the evaluation of text embeddings, MTEB is crucial for assessing how well these models generalize across various real-world applications. It contains a wide range of datasets in eight main NLP tasks and different languages, and provides an easy pipeline for evaluation. It also holds the well known MTEB `leaderboard `_, which contains a ranking of the latest first-class embedding models. You can evaluate model's performance on the whole MTEB benchmark by running our provided shell script: .. code:: bash chmod +x /examples/evaluation/mteb/eval_mteb.sh ./examples/evaluation/mteb/eval_mteb.sh Or by running: .. code:: bash python -m FlagEmbedding.evaluation.mteb \ --eval_name mteb \ --output_dir ./mteb/search_results \ --languages eng \ --tasks NFCorpus BiorxivClusteringS2S SciDocsRR \ --eval_output_path ./mteb/mteb_eval_results.json \ --embedder_name_or_path BAAI/bge-large-en-v1.5 \ --devices cuda:7 \ --cache_dir /root/.cache/huggingface/hub change the embedder, devices and cache directory to your preference. .. toctree:: :hidden: mteb/arguments mteb/searcher mteb/runner --- Evaluation ========== .. toctree:: evaluation/mteb evaluation/airbench evaluation/msmarco evaluation/beir evaluation/miracl evaluation/mkqa evaluation/mldr --- Arguments ========= .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.base.DecoderOnlyEmbedderModelArguments --- ======== Modeling ======== .. autoclass:: FlagEmbedding.finetune.reranker.decoder_only.base.CrossDecoderModel Methods ======= .. automethod:: FlagEmbedding.finetune.reranker.decoder_only.base.CrossDecoderModel.encode --- Runner ====== .. autoclass:: FlagEmbedding.finetune.reranker.decoder_only.base.DecoderOnlyRerankerRunner :members: --- Trainer ======= .. autoclass:: FlagEmbedding.finetune.reranker.decoder_only.base.DecoderOnlyRerankerTrainer :members: --- Base ==== .. toctree:: base/arguments base/modeling base/runner base/trainer --- Arguments ========= .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.icl.DecoderOnlyEmbedderICLModelArguments .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.icl.DecoderOnlyEmbedderICLDataArguments --- ======= Dataset ======= DecoderOnlyEmbedderICLSameDatasetTrainDataset ============================================= .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.icl.DecoderOnlyEmbedderICLSameDatasetTrainDataset Methods ------- .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.DecoderOnlyEmbedderICLSameDatasetTrainDataset._create_batch_data AbsEmbedderSameDatasetCollator ============================== .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.icl.AbsEmbedderSameDatasetCollator --- ======== Modeling ======== .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel Methods ======= .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel.encode .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel.compute_score .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel.compute_loss .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel.gradient_checkpointing_enable .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel.enable_input_require_grads .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel.save .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel._sentence_embedding .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.icl.BiDecoderOnlyEmbedderICLModel._compute_similarity --- Runner ====== .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.icl.DecoderOnlyEmbedderICLRunner :members: --- Trainer ======= .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.icl.DecoderOnlyEmbedderICLTrainer :members: --- ICL === .. toctree:: icl/arguments icl/dataset icl/modeling icl/runner icl/trainer --- Decoder Only ============ .. toctree:: decoder_only/base decoder_only/icl --- Modeling ======== .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel Methods ------- .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel.encode .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel.compute_score .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel.compute_loss .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel.gradient_checkpointing_enable .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel.enable_input_require_grads .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel.save .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel._sentence_embedding .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.base.BiEncoderOnlyEmbedderModel._compute_similarity --- Runner ====== .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.base.EncoderOnlyEmbedderRunner :members: --- Trainer ======= .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.base.EncoderOnlyEmbedderTrainer :members: --- Base ==== .. toctree:: base/modeling base/runner base/trainer --- Arguments ========= .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3ModelArguments .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3TrainingArguments --- ======== Modeling ======== EncoderOnlyEmbedderM3Model ============================ .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model Methods ------- .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.encode .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.compute_score .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.compute_dense_score .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.compute_sparse_score .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.compute_colbert_score .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.ensemble_score .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.forward .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.compute_loss .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.gradient_checkpointing_enable .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.enable_input_require_grads .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model.save .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model._dense_embedding .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model._sparse_embedding .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model._colbert_embedding .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model._encode .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model._compute_similarity .. automethod:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Model._get_queries_attention_mask EncoderOnlyEmbedderM3ModelForInference ====================================== .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3ModelForInference :members: --- Runner ====== .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Runner :members: --- Trainer ======= .. autoclass:: FlagEmbedding.finetune.embedder.encoder_only.m3.EncoderOnlyEmbedderM3Trainer :members: --- M3 == .. toctree:: m3/arguments m3/modeling m3/runner m3/trainer --- Encoder Only ============ .. toctree:: encoder_only/base encoder_only/m3 --- Embedder ======== .. toctree:: embedder/encoder_only embedder/decoder_only --- Arguments ========= .. autoclass:: FlagEmbedding.finetune.reranker.decoder_only.base.RerankerModelArguments --- ======== Modeling ======== .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel Methods ======= .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel.encode .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel.compute_score .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel.compute_loss .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel.gradient_checkpointing_enable .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel.enable_input_require_grads .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel.save .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel._sentence_embedding .. automethod:: FlagEmbedding.finetune.embedder.decoder_only.base.BiDecoderOnlyEmbedderModel._compute_similarity --- Runner ====== .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.base.DecoderOnlyEmbedderRunner :members: --- Trainer ======= .. autoclass:: FlagEmbedding.finetune.embedder.decoder_only.base.DecoderOnlyEmbedderTrainer :members: --- Base ==== .. toctree:: base/arguments base/modeling base/runner base/trainer --- Arguments ========= .. autoclass:: FlagEmbedding.finetune.reranker.decoder_only.layerwise.RerankerModelArguments --- ======== Modeling ======== .. autoclass:: FlagEmbedding.finetune.reranker.decoder_only.layerwise.CrossDecoderModel Methods ======= .. automethod:: FlagEmbedding.finetune.reranker.decoder_only.layerwise.CrossDecoderModel.encode .. automethod:: FlagEmbedding.finetune.reranker.decoder_only.layerwise.CrossDecoderModel.forward --- Runner ====== .. autoclass:: FlagEmbedding.finetune.reranker.decoder_only.layerwise.DecoderOnlyRerankerRunner :members: --- Trainer ======= .. autoclass:: FlagEmbedding.finetune.reranker.decoder_only.layerwise.DecoderOnlyRerankerTrainer :members: --- Layerwise ========= .. toctree:: layerwise/arguments layerwise/modeling layerwise/runner layerwise/trainer --- Decoder Only ============ .. toctree:: decoder_only/base decoder_only/layerwise --- Modeling ======== .. autoclass:: FlagEmbedding.finetune.reranker.encoder_only.base.CrossEncoderModel Methods ------- .. automethod:: FlagEmbedding.finetune.reranker.encoder_only.base.CrossEncoderModel.encode --- Runner ====== .. autoclass:: FlagEmbedding.finetune.reranker.encoder_only.base.EncoderOnlyRerankerRunner :members: --- Trainer ======= .. autoclass:: FlagEmbedding.finetune.reranker.encoder_only.base.EncoderOnlyRerankerTrainer :members: --- Base ==== .. toctree:: base/modeling base/runner base/trainer --- Encoder Only ============ .. toctree:: encoder_only/base --- Reranker ======== .. toctree:: reranker/encoder_only reranker/decoder_only --- Finetune ======== .. toctree:: finetune/embedder finetune/reranker --- API === .. toctree:: :maxdepth: 1 abc inference evaluation finetune --- FlagAutoModel ============= .. autoclass:: FlagEmbedding.inference.FlagAutoModel Methods ------- .. automethod:: FlagEmbedding.inference.FlagAutoModel.from_finetuned --- FlagAutoReranker ================ .. autoclass:: FlagEmbedding.inference.FlagAutoReranker Methods ------- .. automethod:: FlagEmbedding.inference.FlagAutoReranker.from_finetuned --- BaseEmbedder ============ .. autoclass:: FlagEmbedding.inference.embedder.decoder_only.base.BaseLLMEmbedder Methods ------- .. automethod:: FlagEmbedding.inference.embedder.decoder_only.base.BaseLLMEmbedder.encode_queries .. automethod:: FlagEmbedding.inference.embedder.decoder_only.base.BaseLLMEmbedder.encode_corpus .. automethod:: FlagEmbedding.inference.embedder.decoder_only.base.BaseLLMEmbedder.encode .. automethod:: FlagEmbedding.inference.embedder.decoder_only.base.BaseLLMEmbedder.encode_single_device --- ICLLLMEmbedder ============== .. autoclass:: FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder Methods ------- .. automethod:: FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder.encode_queries .. automethod:: FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder.encode_corpus .. automethod:: FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder.encode .. automethod:: FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder.set_examples .. automethod:: FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder.get_detailed_example .. automethod:: FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder.encode_queries_single_device .. automethod:: FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder.encode_single_device --- Embedder ======== .. toctree:: encoder_only/BaseEmbedder encoder_only/M3Embedder decoder_only/BaseLLMEmbedder decoder_only/ICLLLMEmbedder --- BaseEmbedder ============ .. autoclass:: FlagEmbedding.inference.embedder.encoder_only.base.BaseEmbedder Methods ------- .. automethod:: FlagEmbedding.inference.embedder.encoder_only.base.BaseEmbedder.encode_queries :no-index: .. automethod:: FlagEmbedding.inference.embedder.encoder_only.base.BaseEmbedder.encode_corpus .. automethod:: FlagEmbedding.inference.embedder.encoder_only.base.BaseEmbedder.encode .. automethod:: FlagEmbedding.inference.embedder.encoder_only.base.BaseEmbedder.encode_single_device .. automethod:: FlagEmbedding.inference.embedder.encoder_only.base.BaseEmbedder.pooling --- M3Embedder ============ .. autoclass:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder Methods ------- .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.encode_queries .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.encode_corpus .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.encode .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.convert_id_to_token .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.compute_lexical_matching_score .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.colbert_score .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.encode_single_device .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.compute_score .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.compute_score_multi_process .. automethod:: FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder.compute_score_single_device --- BaseLLMReranker =============== .. autoclass:: FlagEmbedding.inference.reranker.decoder_only.base.BaseLLMReranker Methods ------- .. autoclass:: FlagEmbedding.inference.reranker.decoder_only.base.BaseLLMReranker.compute_score_single_gpu --- LayerWiseLLMReranker ==================== .. autoclass:: FlagEmbedding.inference.reranker.decoder_only.layerwise.LayerWiseLLMReranker Methods ------- .. autoclass:: FlagEmbedding.inference.reranker.decoder_only.layerwise.LayerWiseLLMReranker.compute_score_single_gpu --- LightweightLLMReranker ====================== .. autoclass:: FlagEmbedding.inference.reranker.decoder_only.lightweight.LightweightLLMReranker Methods ------- .. autoclass:: FlagEmbedding.inference.reranker.decoder_only.lightweight.LightweightLLMReranker.compute_score_single_gpu --- BaseReranker ============ .. autoclass:: FlagEmbedding.inference.reranker.encoder_only.base.BaseReranker Methods ------- .. autoclass:: FlagEmbedding.inference.reranker.encoder_only.base.BaseReranker.compute_score_single_gpu --- Reranker ======== .. toctree:: encoder_only/BaseReranker decoder_only/BaseLLMReranker decoder_only/LayerWiseLLMReranker decoder_only/LightweightLLMReranker --- Inference ========= .. toctree:: inference/FlagAutoModel inference/FlagAutoReranker inference/embedder/embedder inference/reranker/reranker --- .. C-MTEB .. ====== .. Introduction .. ------------ .. `C-MTEB `_ is a benchmark for chinese text embedding. It contains 35 .. datasets in 6 different tasks, providing a comprehensive evaluation to the quality of an embedding model on Chinese. .. .. image:: ../_static/img/C_MTEB.png .. :width: 700 .. :align: center .. Installation .. ------------ .. C-MTEB is developed based on MTEB, you can install C-MTEB by: .. .. code:: bash .. pip install -U C_MTEB .. or install by FlagEmbedding's repo: .. .. code:: bash .. git clone https://github.com/FlagOpen/FlagEmbedding.git .. cd FlagEmbedding/C_MTEB .. pip install -e . .. Citing the Work .. --------------- .. There are more details in our publication. If you find C-MTEB useful, you can cite it by: .. .. code:: .. @misc{c-pack, .. title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, .. author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, .. year={2023}, .. eprint={2309.07597}, .. archivePrefix={arXiv}, .. primaryClass={cs.CL} .. } --- FAQ === Below are some commonly asked questions. .. tip:: For more questions, search in issues on GitHub or join our community! .. dropdown:: Having network issue when connecting to Hugging Face? :animate: fade-in-slide-down Try to set the :code:`HF_ENDPOINT` to `HF mirror `_ instead. .. code:: bash export HF_ENDPOINT=https://hf-mirror.com .. dropdown:: When does the query instruction need to be used? :animate: fade-in-slide-down For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. .. dropdown:: Why it takes quite long to just encode 1 sentence? :animate: fade-in-slide-down Note that if you have multiple CUDA GPUs, FlagEmbedding will automatically use all of them. Then the time used to start the multi-process will cost way longer than the actual encoding. Try to just use CPU or just single GPU for simple tasks. .. dropdown:: The embedding results are different for CPU and GPU? :animate: fade-in-slide-down The encode function will use FP16 by default if GPU is available, which leads to different precision. Set :code:`fp16=False` to get full precision. .. dropdown:: How many languages do the multi-lingual models support? :animate: fade-in-slide-down The training datasets cover up to 170+ languages. But note that due to the unbalanced distribution of languages, the performances will be different. Please further test refer to the real application scenario. .. dropdown:: How does the different retrieval method works in bge-m3? :animate: fade-in-slide-down - Dense retrieval: map the text into a single embedding, e.g., `DPR `_, `BGE-v1.5 <../bge/bge_v1_v1.5>`_ - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, `unicoil `_, and `splade `_ - Multi-vector retrieval: use multiple vectors to represent a text, e.g., `ColBERT `_. .. dropdown:: Recommended vector database? :animate: fade-in-slide-down Generally you can use any vector database (open-sourced, commercial). We use `Faiss `_ by default in our evaluation pipeline and tutorials. .. dropdown:: No enough VRAM or OOM error during evaluation? :animate: fade-in-slide-down The default values of :code:`embedder_batch_size` and :code:`reranker_batch_size` are both 3000. Try a smaller value. --- Information Retrieval ===================== What is Information Retrieval? ------------------------------ Simply put, Information Retrieval (IR) is the science of searching and retrieving information from a large collection of data based on a user's query. The goal of an IR system is not just to return a list of documents but to ensure that the most relevant ones appear at the top of the results. A very straightforward example of IR is library catalog. One wants to find the book that best matches the query, but there are thousands or millions of books on the shelf. The library's catalog system helps you find the best matches based on your search terms. In modern digital world, search engines and databases work in a similar way, using sophisticated algorithms and models to retrieve, rank and return the most relevant results. And the resource categories are expanding from text to more modalities such as images, videos, 3D objects, music, etc. IR and Embedding Model ---------------------- Traditional IR methods, like TF-IDF and BM25, rely on statistical and heuristic techniques to rank documents based on term frequency and document relevance. These methods are efficient and effective for keyword-based search but often struggle with understanding the deeper context or semantics of the text. .. seealso:: Take a very simple example with two sentences: .. code:: python sentence_1 = "watch a play" sentence_2 = "play with a watch" Sentence 1 means going for a show/performance, which has watch as a verb and play as a noun. However sentence 2 means someone is interacting with a timepiece on wrist, which has play as a verb and watch as a noun. These two sentences could be regard as very similar to each other when using the traditional IR methods though they actually have totally different semantic meaning. Then how could we solve this? The best answer up until now is embedding models. Embedding models have revolutionized IR by representing text as dense vectors in a high-dimensional space, capturing the semantic meaning of words, sentences, or even entire documents. This allows for more sophisticated search capabilities, such as semantic search, where results are ranked based on meaning rather than simple keyword matching. --- Embedder ======== .. tip:: If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`! Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space. These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis. A very famous demonstration is the example from `word2vec `_. It shows how word embeddings capture semantic relationships through vector arithmetic: .. image:: ../_static/img/word2vec.png :width: 500 :align: center Nowadays, embedders are capable of mapping sentences and even passages into vector space. They are widely used in real world tasks such as retrieval, clustering, etc. In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets. Sparse Vector ------------- Sparse vectors usually have structure of high dimensionality with only a few non-zero values, which usually effective for tasks like keyword matching. Typically, though not always, the number of dimensions in sparse vectors corresponds to the different tokens present in the language. Each dimension is assigned a value representing the token's relative importance within the document. Some well known algorithms for sparse vector embedding includes `bag-of-words `_, `TF-IDF `_, `BM25 `_, etc. Sparse vector embeddings have great ability to extract the information of key terms and their corresponding importance within documents. Dense Vector ------------ Dense vectors typically use neural networks to map words, sentences, and passages into a fixed dimension latent vector space. Then we can compare the similarity between two objects using certain metrics like Euclidean distance or Cos similarity. Those vectors can represent deeper meaning of the sentences. Thus we can distinguish sentences using similar words but actually have different meaning. And also understand different ways in speaking and writing that express the same thing. Dense vector embeddings, instead of keywords counting and matching, directly capture the semantics. --- Introduction ============ BGE builds one-stop retrieval toolkit for search and RAG. We provide inference, evaluation, and fine-tuning for embedding models and reranker. .. figure:: ../_static/img/RAG_pipeline.png :width: 700 :align: center BGE embedder and reranker in an RAG pipeline. `Source `_ Quickly get started with: .. toctree:: :maxdepth: 1 :caption: Start overview installation quick_start .. toctree:: :maxdepth: 1 :caption: Concept IR embedder reranker similarity retrieval_demo --- :github_url: https://github.com/AI4Finance-Foundation/FinRL Installation ============ Using pip: ---------- If you do not need to finetune the models, you can install the package without the finetune dependency: .. code:: bash pip install -U FlagEmbedding If you want to finetune the models, you can install the package with the finetune dependency: .. code:: bash pip install -U FlagEmbedding[finetune] Install from sources: --------------------- Clone the repository and install .. code:: bash git clone https://github.com/FlagOpen/FlagEmbedding.git cd FlagEmbedding # If you do not need to finetune the models, you can install the package without the finetune dependency: pip install . # If you want to finetune the models, install the package with the finetune dependency: pip install .[finetune] For development in editable mode: .. code:: bash # If you do not need to finetune the models, you can install the package without the finetune dependency: pip install -e . # If you want to finetune the models, install the package with the finetune dependency: pip install -e .[finetune] PyTorch-CUDA ------------ If you want to use CUDA GPUs during inference and finetuning, please install appropriate version of `PyTorch `_ with CUDA support. --- Overview ======== Our repository provides well-structured `APIs `_ for the inference, evaluation, and fine-tuning of BGE series models. Besides that, there are abundant resources of and for users to quickly get a hands-on experience. .. figure:: https://raw.githubusercontent.com/FlagOpen/FlagEmbedding/refs/heads/master/imgs/projects.png :width: 700 :align: center Structure of contents in our `repo `_ Our repository provides well-structured contents for information retrieval and RAG: - The core `APIs <../API>`_ for embedding models' inference, evaluation, and fine-tuning. - Hands-on `examples `_ for the three mentioned use cases. - Detailed `tutorials `_ covering topics in retrieval to help you learn from scratch. --- Quick Start =========== First, load one of the BGE embedding model: .. code:: python from FlagEmbedding import FlagAutoModel model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5') .. tip:: If there's difficulty connecting to Hugging Face, you can use the `HF mirror `_ instead. .. code:: bash export HF_ENDPOINT=https://hf-mirror.com Then, feed some sentences to the model and get their embeddings: .. code:: python sentences_1 = ["I love NLP", "I love machine learning"] sentences_2 = ["I love BGE", "I love text retrieval"] embeddings_1 = model.encode(sentences_1) embeddings_2 = model.encode(sentences_2) Once we get the embeddings, we can compute similarity by inner product: .. code:: python similarity = embeddings_1 @ embeddings_2.T print(similarity) --- Reranker ======== .. tip:: If you are already familiar with the concepts, take a look at the :doc:`BGE rerankers <../bge/index>`! Reranker, or Cross-Encoder, is a model that refines the ranking of candidate pairs (e.g., query-document pairs) by jointly encoding and scoring them. Typically, we use embedder as a Bi-Encoder. It first computes the embeddings of two input sentences, then compute their similarity using metrics such as cosine similarity or Euclidean distance. Whereas a reranker takes two sentences at the same time and directly computer a score representing their similarity. The following figure shows their difference: .. figure:: https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png :width: 500 :align: center Bi-Encoder & Cross-Encoder (from Sentence Transformers) Although Cross-Encoder usually has better performances than Bi-Encoder, it is extremly time consuming to use Cross-Encoder if we have a great amount of data. Thus a widely accepted approach is to use a Bi-Encoder for initial retrieval (e.g., selecting the top 100 candidates from 100,000 sentences) and then refine the ranking of the selected candidates using a Cross-Encoder for more accurate results. --- Similarity ========== A primary goal of retrieval is to find the most relevant documents in response to a user's query. One of the core components of this process is measuring similarity between the query and candidates. Similarity metrics quantify how closely related two pieces of data are, and guide the retrieval system in ranking results. Jaccard Similarity ------------------ .. math:: J(A,B)=\frac{|A\cap B|}{|A\cup B|} The Jaccard similarity or Jaccard index is commonly used for set-based similarity, particularly in binary data (e.g., whether a term appears in a document or not). It is calculated as the size of the intersection of two sets divided by the size of their union. In information retrieval, it's often used to compare sets of keywords or phrases, with higher values indicating more similarity. Euclidean Distance ------------------ .. math:: d(A, B) = \|A-B\|_2 = \sqrt{\sum_{i=1}^n (A_i-B_i)^2} Euclidean distance measures the straight-line distance between two points in a vector space. In IR, this can be used to assess the difference between document or query vectors. A smaller distance indicates greater similarity. This metric is intuitive but can sometimes be sensitive to the scale of the data, especially in high-dimensional spaces like text embeddings. Cosine Similarity ----------------- .. math:: \cos(\theta)=\frac{A\cdot B}{\|A\|\|B\|} Cosine similarity is one of the most widely used metrics in information retrieval, especially for text. It measures the cosine of the angle between two vectors in a multi-dimensional space (typically representing term frequency vectors of documents and queries). If the cosine similarity is closer to 1, the vectors are more similar. A value of 0 indicates orthogonality, meaning no similarity. It's a simple yet effective measure for text-based retrieval, as it considers the orientation but not the magnitude of vectors. Dot Product ----------- Coordinate definition: .. math:: A\cdot B = \sum_{i=1}^{i=n}A_i B_i Geometric definition: .. math:: A\cdot B = \|A\|\|B\|\cos(\theta) The dot product between two vectors provides a measure of how similar the vectors are in terms of direction and magnitude. In information retrieval, the dot product is often used in vector space models, particularly when dealing with pre-trained word or sentence embeddings. A higher dot product indicates that the query and document are closely aligned in the vector space. --- BGE-Code-v1 =========== **`BGE-Code-v1 `_** is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities: - Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages. - Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale. - Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more. +-------------------------------------------------------------------+-----------------+------------+--------------+----------------------------------------------------------------------------------------------------+ | Model | Language | Parameters | Model Size | Description | +===================================================================+=================+============+==============+====================================================================================================+ | `BAAI/bge-code-v1 `_ | Multilingual | 1.5B | 6.18 GB | SOTA code retrieval model, with exceptional multilingual text retrieval performance as well | +-------------------------------------------------------------------+-----------------+------------+--------------+----------------------------------------------------------------------------------------------------+ .. code:: python from FlagEmbedding import FlagLLMModel queries = [ "Delete the record with ID 4 from the 'Staff' table.", 'Delete all records in the "Livestock" table where age is greater than 5' ] documents = [ "DELETE FROM Staff WHERE StaffID = 4;", "DELETE FROM Livestock WHERE age > 5;" ] model = FlagLLMModel('BAAI/bge-code-v1', query_instruction_format="{}\n{}", query_instruction_for_retrieval="Given a question in text, retrieve SQL queries that are appropriate responses to the question.", trust_remote_code=True, use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode_queries(queries) embeddings_2 = model.encode_corpus(documents) similarity = embeddings_1 @ embeddings_2.T print(similarity) --- BGE-EN-ICL ========== BGE-EN-ICL is the new SoTA embedding model in BGE series with capabilities: - In-context learning ability: By providing few-shot examples in the query, it can significantly enhance the model's ability to handle new tasks. - Outstanding performance: The model has achieved state-of-the-art (SOTA) performance on MTEB and AIR-Bench. +-------------------------------------------------------------------+-----------------+------------+--------------+----------------------------------------------------------------------------------------------------+ | Model | Language | Parameters | Model Size | Description | +===================================================================+=================+============+==============+====================================================================================================+ | `BAAI/bge-en-icl `_ | English | 7.1B | 28.5 GB | In-context learning capabilities, fully leverage the model's potential based on a few shot examples| +-------------------------------------------------------------------+-----------------+------------+--------------+----------------------------------------------------------------------------------------------------+ Usage ----- .. code:: python from FlagEmbedding import FlagICLModel documents = [ "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.", "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments." ] examples = [ { 'instruct': 'Given a web search query, retrieve relevant passages that answer the query.', 'query': 'what is a virtual interface', 'response': "A virtual interface is a software-defined abstraction that mimics the behavior and characteristics of a physical network interface. It allows multiple logical network connections to share the same physical network interface, enabling efficient utilization of network resources. Virtual interfaces are commonly used in virtualization technologies such as virtual machines and containers to provide network connectivity without requiring dedicated hardware. They facilitate flexible network configurations and help in isolating network traffic for security and management purposes." }, { 'instruct': 'Given a web search query, retrieve relevant passages that answer the query.', 'query': 'causes of back pain in female for a week', 'response': "Back pain in females lasting a week can stem from various factors. Common causes include muscle strain due to lifting heavy objects or improper posture, spinal issues like herniated discs or osteoporosis, menstrual cramps causing referred pain, urinary tract infections, or pelvic inflammatory disease. Pregnancy-related changes can also contribute. Stress and lack of physical activity may exacerbate symptoms. Proper diagnosis by a healthcare professional is crucial for effective treatment and management." } ] queries = ["how much protein should a female eat", "summit define"] model = FlagICLModel('BAAI/bge-en-icl', examples_for_task=examples, # set `examples_for_task=None` to use model without examples examples_instruction_format="{}\n{}\n{}") # specify the format to use examples_for_task embeddings_1 = model.encode_queries(queries) embeddings_2 = model.encode_corpus(documents) similarity = embeddings_1 @ embeddings_2.T print(similarity) --- ====== BGE-M3 ====== BGE-M3 is a compound and powerful embedding model distinguished for its versatility in: - **Multi-Functionality**: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. - **Multi-Linguality**: It can support more than 100 working languages. - **Multi-Granularity**: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. +-------------------------------------------------------------------+-----------------+------------+--------------+-----------------------------------------------------------------------+ | Model | Language | Parameters | Model Size | Description | +===================================================================+=================+============+==============+=======================================================================+ | `BAAI/bge-m3 `_ | Multi-Lingual | 569M | 2.27 GB | Multi-Functionality, Multi-Linguality, and Multi-Granularity | +-------------------------------------------------------------------+-----------------+------------+--------------+-----------------------------------------------------------------------+ Multi-Linguality ================ BGE-M3 was trained on multiple datasets covering up to 170+ different languages. While the amount of training data on languages are highly unbalanced, the actual model performance on different languages will have difference. For more information of datasets and evaluation results, please check out our `paper `_ for details. Multi-Granularity ================= We extend the max position to 8192, enabling the embedding of larger corpus. Proposing a simple but effective method: MCLS (Multiple CLS) to enhance the model's ability on long text without additional fine-tuning. Multi-Functionality =================== .. code:: python from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('BAAI/bge-m3') sentences_1 = ["What is BGE M3?", "Defination of BM25"] sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] Dense Retrieval --------------- Similar to BGE v1 or v1.5 models, BGE-M3 use the normalized hidden state of the special token [CLS] as the dense embedding: .. math:: e_q = norm(H_q[0]) Next, to compute the relevance score between the query and passage: .. math:: s_{dense}=f_{sim}(e_p, e_q) where :math:`e_p, e_q` are the embedding vectors of passage and query, respectively. :math:`f_{sim}` is the score function (such as inner product and L2 distance) for comupting two embeddings' similarity. Sparse Retrieval ---------------- BGE-M3 generates sparce embeddings by adding a linear layer and a ReLU activation function following the hidden states: .. math:: w_{qt} = \text{Relu}(W_{lex}^T H_q [i]) where :math:`W_{lex}` representes the weights of linear layer and :math:`H_q[i]` is the encoder's output of the :math:`i^{th}` token. Based on the tokens' weights of query and passage, the relevance score between them is computed by the joint importance of the co-existed terms within the query and passage: .. math:: s_{lex} = \sum_{t\in q\cap p}(w_{qt} * w_{pt}) where :math:`w_{qt}, w_{pt}` are the importance weights of each co-existed term :math:`t` in query and passage, respectively. Multi-Vector ------------ The multi-vector method utilizes the entire output embeddings for the representation of query :math:`E_q` and passage :math:`E_p`. .. math:: E_q = norm(W_{mul}^T H_q) E_p = norm(W_{mul}^T H_p) where :math:`W_{mul}` is the learnable projection matrix. Following ColBert, BGE-M3 use late-interaction to compute the fine-grained relevance score: .. math:: s_{mul}=\frac{1}{N}\sum_{i=1}^N\max_{j=1}^M E_q[i]\cdot E_p^T[j] where :math:`E_q, E_p` are the entire output embeddings of query and passage, respectively. This is a summation of average of maximum similarity of each :math:`v\in E_q` with vectors in :math:`E_p`. Hybrid Ranking -------------- BGE-M3's multi-functionality gives the possibility of hybrid ranking to improve retrieval. Firstly, due to the heavy cost of multi-vector method, we can retrieve the candidate results by either of the dense or sparse method. Then, to get the final result, we can rerank the candidates based on the integrated relevance score: .. math:: s_{rank} = w_1\cdot s_{dense}+w_2\cdot s_{lex} + w_3\cdot s_{mul} where the values chosen for :math:`w_1`, :math:`w_2` and :math:`w_3` varies depending on the downstream scenario. Usage ===== .. code:: python from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('BAAI/bge-m3') sentences_1 = ["What is BGE M3?", "Defination of BM25"] output = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True) dense, sparse, multiv = output['dense_vecs'], output['lexical_weights'], output['colbert_vecs'] Useful Links: `API <../API/inference/embedder/encoder_only/M3Embedder>`_ `Tutorial `_ `Example `_ --- BGE-Reranker ============ Different from embedding model, reranker, or cross-encoder uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use a bge embedding model to first retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. The first series of BGE-Reranker contains two models, large and base. +-------------------------------------------------------------------------------+-----------------------+------------+--------------+-----------------------------------------------------------------------+ | Model | Language | Parameters | Model Size | Description | +===============================================================================+=======================+============+==============+=======================================================================+ | `BAAI/bge-reranker-large `_ | English & Chinese | 560M | 2.24 GB | Larger reranker model, easy to deploy with better inference | +-------------------------------------------------------------------------------+-----------------------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-reranker-base `_ | English & Chinese | 278M | 1.11 GB | Lightweight reranker model, easy to deploy with fast inference | +-------------------------------------------------------------------------------+-----------------------+------------+--------------+-----------------------------------------------------------------------+ bge-reranker-large and bge-reranker-base used `XLM-RoBERTa-Large `_ and `XLM-RoBERTa-Base `_ respectively as the base model. They were trained on high quality English and Chinese data, and acheived State-of-The-Art performance in the level of same size models at the time released. Usage ----- .. code:: python from FlagEmbedding import FlagReranker reranker = FlagReranker( 'BAAI/bge-reranker-base', query_max_length=256, use_fp16=True, devices=['cuda:1'], ) score = reranker.compute_score(['I am happy to help', 'Assisting you is my pleasure']) print(score) --- BGE-Reranker-v2 =============== +------------------------------------------------------------------------------------------------------------------+-----------------------+-------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ | Model | Language | Parameters | Model Size | Description | +==================================================================================================================+=======================+=============+==============+=========================================================================================================================================================+ | `BAAI/bge-reranker-v2-m3 `_ | Multilingual | 568M | 2.27 GB | Lightweight reranker model, possesses strong multilingual capabilities, easy to deploy, with fast inference. | +------------------------------------------------------------------------------------------------------------------+-----------------------+-------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ | `BAAI/bge-reranker-v2-gemma `_ | Multilingual | 2.51B | 10 GB | Suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities. | +------------------------------------------------------------------------------------------------------------------+-----------------------+-------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ | `BAAI/bge-reranker-v2-minicpm-layerwise `_ | Multilingual | 2.72B | 10.9 GB | Suitable for multilingual contexts, allows freedom to select layers for output, facilitating accelerated inference. | +------------------------------------------------------------------------------------------------------------------+-----------------------+-------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ | `BAAI/bge-reranker-v2.5-gemma2-lightweight `_ | Multilingual | 2.72B | 10.9 GB | Suitable for multilingual contexts, allows freedom to select layers, compress ratio and compress layers for output, facilitating accelerated inference. | +------------------------------------------------------------------------------------------------------------------+-----------------------+-------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ .. tip:: You can select the model according your senario and resource: - For multilingual, utilize :code:`BAAI/bge-reranker-v2-m3`, :code:`BAAI/bge-reranker-v2-gemma` and :code:`BAAI/bge-reranker-v2.5-gemma2-lightweight`. - For Chinese or English, utilize :code:`BAAI/bge-reranker-v2-m3` and :code:`BAAI/bge-reranker-v2-minicpm-layerwise`. - For efficiency, utilize :code:`BAAI/bge-reranker-v2-m3` and the low layer of :code:`BAAI/bge-reranker-v2-minicpm-layerwise`. - For better performance, recommand :code:`BAAI/bge-reranker-v2-minicpm-layerwise` and :code:`BAAI/bge-reranker-v2-gemma`. Make sure always test on your real use case and choose the one with best speed-quality balance! Usage ----- **bge-reranker-v2-m3** Use :code:`bge-reranker-v2-m3` in the same way as bge-reranker-base and bge-reranker-large. .. code:: python from FlagEmbedding import FlagReranker # Setting use_fp16 to True speeds up computation with a slight performance degradation reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True) score = reranker.compute_score(['query', 'passage']) # or set "normalize=True" to apply a sigmoid function to the score for 0-1 range score = reranker.compute_score(['query', 'passage'], normalize=True) print(score) **bge-reranker-v2-gemma** Use the :code:`FlagLLMReranker` class for bge-reranker-v2-gemma. .. code:: python from FlagEmbedding import FlagLLMReranker # Setting use_fp16 to True speeds up computation with a slight performance degradation reranker = FlagLLMReranker('BAAI/bge-reranker-v2-gemma', use_fp16=True) score = reranker.compute_score(['query', 'passage']) print(score) **bge-reranker-v2-minicpm-layerwise** Use the :code:`LayerWiseFlagLLMReranker` class for bge-reranker-v2-minicpm-layerwise. .. code:: python from FlagEmbedding import LayerWiseFlagLLMReranker # Setting use_fp16 to True speeds up computation with a slight performance degradation reranker = LayerWiseFlagLLMReranker('BAAI/bge-reranker-v2-minicpm-layerwise', use_fp16=True) # Adjusting 'cutoff_layers' to pick which layers are used for computing the score. score = reranker.compute_score(['query', 'passage'], cutoff_layers=[28]) print(score) **bge-reranker-v2.5-gemma2-lightweight** Use the :code:`LightWeightFlagLLMReranker` class for bge-reranker-v2.5-gemma2-lightweight. .. code:: python from FlagEmbedding import LightWeightFlagLLMReranker # Setting use_fp16 to True speeds up computation with a slight performance degradation reranker = LightWeightFlagLLMReranker('BAAI/bge-reranker-v2.5-gemma2-lightweight', use_fp16=True) # Adjusting 'cutoff_layers' to pick which layers are used for computing the score. score = reranker.compute_score(['query', 'passage'], cutoff_layers=[28], compress_ratio=2, compress_layer=[24, 40]) print(score) --- BGE v1 & v1.5 ============= BGE v1 and v1.5 are series of encoder only models base on BERT. They achieved best performance among the models of the same size at the time of release. BGE --- The first group of BGE models was released in Aug 2023. The :code:`bge-large-en` and :code:`bge-large-zh` ranked 1st on MTEB and C-MTEB benchmarks at the time released. +-------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | Model | Language | Parameters | Model Size | Description | +===================================================================+===========+============+==============+=======================================================================+ | `BAAI/bge-large-en `_ | English | 335M | 1.34 GB | Embedding Model which map text into vector | +-------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-base-en `_ | English | 109M | 438 MB | a base-scale model but with similar ability to `BAAI/bge-large-en` | +-------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-small-en `_ | English | 33.4M | 133 MB | a small-scale model but with competitive performance | +-------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-large-zh `_ | Chinese | 326M | 1.3 GB | Embedding Model which map text into vector | +-------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-base-zh `_ | Chinese | 102M | 409 MB | a base-scale model but with similar ability to `BAAI/bge-large-zh` | +-------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-small-zh `_ | Chinese | 24M | 95.8 MB | a small-scale model but with competitive performance | +-------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ BGE-v1.5 -------- Then to enhance its retrieval ability without instruction and alleviate the issue of the similarity distribution, :code:`bge-*-v1.5` models were released in Sep 2023. They are still the most popular embedding models that balanced well between embedding quality and model sizes. +-----------------------------------------------------------------------------+-----------+------------+--------------+--------------+ | Model | Language | Parameters | Model Size | Description | +=============================================================================+===========+============+==============+==============+ | `BAAI/bge-large-en-v1.5 `_ | English | 335M | 1.34 GB | version 1.5 | +-----------------------------------------------------------------------------+-----------+------------+--------------+ with more + | `BAAI/bge-base-en-v1.5 `_ | English | 109M | 438 MB | reasonable | +-----------------------------------------------------------------------------+-----------+------------+--------------+ similarity + | `BAAI/bge-small-en-v1.5 `_ | English | 33.4M | 133 MB | distribution | +-----------------------------------------------------------------------------+-----------+------------+--------------+ and better + | `BAAI/bge-large-zh-v1.5 `_ | Chinese | 326M | 1.3 GB | performance | +-----------------------------------------------------------------------------+-----------+------------+--------------+ + | `BAAI/bge-base-zh-v1.5 `_ | Chinese | 102M | 409 MB | | +-----------------------------------------------------------------------------+-----------+------------+--------------+ + | `BAAI/bge-small-zh-v1.5 `_ | Chinese | 24M | 95.8 MB | | +-----------------------------------------------------------------------------+-----------+------------+--------------+--------------+ Usage ----- To use BGE v1 or v1.5 model for inference, load model through .. code:: python from FlagEmbedding import FlagModel model = FlagModel('BAAI/bge-base-en-v1.5') sentences = ["Hello world", "I am inevitable"] embeddings = model.encode(sentences) .. tip:: For simple tasks that only encode a few sentences like above, it's faster to use CPU or a single GPU instead of multi-GPUs To use CPU: .. code:: python # make no GPU visible import os os.environ['CUDA_VISIBLE_DEVICES'] = '' # or claim the devices during initialize the model model = FlagModel('BAAI/bge-base-en-v1.5', devices='cpu') To use a single GPU: .. code:: python # select one sigle card to be visible import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' # or claim the devices during initialize the model model = FlagModel('BAAI/bge-base-en-v1.5', devices=0) | Useful Links: `API <../API/inference/embedder/encoder_only/BaseEmbedder>`_ `Tutorial `_ `Example `_ --- BGE-VL ====== BGE-VL is a series of multimodel retrieval models training on `MegaPairs `_ BGE-VL contains light weight CLIP based models as well as more powerful LLAVA-NeXT based MLLM models: +----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | Model | Language | Parameters | Model Size | Description | +======================================================================+===========+============+==============+=======================================================================+ | `BAAI/bge-vl-base `_ | English | 150M | 299 MB | Light weight multimodel embedder among image and text | +----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-vl-large `_ | English | 428M | 855 MB | Large scale multimodel embedder among image and text | +----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-vl-MLLM-S1 `_ | English | 7.57B | 15.14 GB | SOTA in composed image retrieval, trained on MegaPairs dataset | +----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/bge-vl-MLLM-S2 `_ | English | 7.57B | 15.14 GB | Finetune BGE-VL-MLLM-S1 with one epoch on MMEB training set | +----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+ | `BAAI/BGE-VL-v1.5-zs `_ | English | 7.57B | 15.14 GB | Better multi-modal retrieval model with performs well in all kinds of tasks | | `BAAI/BGE-VL-v1.5-mmeb `_ | English | 7.57B | 15.14 GB | Better multi-modal retrieval model, additionally fine-tuned on MMEB training set | BGE-VL-CLIP ----------- The base and large model are trained based on CLIP-vit-base-patch16 and CLIP-vit-large-patch14. For composed image-text data, the model directly use score-fusion to sum up the outputs of visual encoder and text encoder and get the final embedding. .. tip:: Our code works well on transformers==4.45.2, and we recommend using this version. You can easily use BGE-VL-CLIP models based on transformers: .. code:: python import torch from transformers import AutoModel MODEL_NAME = "BAAI/BGE-VL-base" # or "BAAI/BGE-VL-large" model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) # You must set trust_remote_code=True model.set_processor(MODEL_NAME) model.eval() with torch.no_grad(): query = model.encode( images = "./assets/cir_query.png", text = "Make the background dark, as if the camera has taken the photo at night" ) candidates = model.encode( images = ["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"] ) scores = query @ candidates.T print(scores) BGE-VL-MLLM ----------- The multimodal large language models (MLLMs) incorporate a visual encoder, typically based on a vision transformer, into a large language model (LLM). This integration allows image tokens to be directly processed by the LLM. Consequently, MLLMs can effectively handle diverse multimodal inputs by converting any type of input into a sequence of tokens. BGE-VL-MLLM builds upon the LLaVA1.6. In both training and inference stages, MMRet uses task-specific instructions for query inputs to improve generalization, aligning with standard practices in LLM-based embedding models. A typical multimodal query input is structured as follows: .. math:: ⟨\text{instruct}⟩{\{task\_ inst\}} \space⟨\text{query}⟩\{q_t\} \{q_i\}\space[\text{EOS}] where :math:`{task_inst}` represents the task-specific instruction, :math:`{qt}` denotes the input query text, and :math:`{qi}` is the input query image. The normalized last hidden state of the [EOS] token in the MLLM is used as the embedding of any given input sequence. .. code:: python import torch from transformers import AutoModel from PIL import Image MODEL_NAME= "BAAI/BGE-VL-MLLM-S1" model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) model.eval() model.cuda() with torch.no_grad(): model.set_processor(MODEL_NAME) query_inputs = model.data_process( text="Make the background dark, as if the camera has taken the photo at night", images="./assets/cir_query.png", q_or_c="q", task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: " ) candidate_inputs = model.data_process( images=["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"], q_or_c="c", ) query_embs = model(**query_inputs, output_hidden_states=True)[:, -1, :] candi_embs = model(**candidate_inputs, output_hidden_states=True)[:, -1, :] query_embs = torch.nn.functional.normalize(query_embs, dim=-1) candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1) scores = torch.matmul(query_embs, candi_embs.T) print(scores) BGE-VL-v1.5 ----------- BGE-VL-v1.5 series is the updated version of BGE-VL, bringing better performance on both retrieval and multi-modal understanding. The models were trained on 30M MegaPairs data and extra 10M natural and synthetic data. `bge-vl-v1.5-zs` is a zero-shot model, only trained on the data mentioned above. `bge-vl-v1.5-mmeb` is the fine-tuned version on MMEB training set. .. code:: python import torch from transformers import AutoModel from PIL import Image MODEL_NAME= "BAAI/BGE-VL-v1.5-mmeb" # "BAAI/BGE-VL-v1.5-zs" model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) model.eval() model.cuda() with torch.no_grad(): model.set_processor(MODEL_NAME) query_inputs = model.data_process( text="Make the background dark, as if the camera has taken the photo at night", images="../../imgs/cir_query.png", q_or_c="q", task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: " ) candidate_inputs = model.data_process( images=["../../imgs/cir_candi_1.png", "../../imgs/cir_candi_2.png"], q_or_c="c", ) query_embs = model(**query_inputs, output_hidden_states=True)[:, -1, :] candi_embs = model(**candidate_inputs, output_hidden_states=True)[:, -1, :] query_embs = torch.nn.functional.normalize(query_embs, dim=-1) candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1) scores = torch.matmul(query_embs, candi_embs.T) print(scores) For more details, check out the repo of `MegaPairs `_ --- BGE === .. figure:: ../_static/img/bge_logo.jpeg :width: 250 :align: center **BGE** stands for **BAAI General Embeddings**, which is a series of embedding models released by BAAI. .. toctree:: :maxdepth: 1 :caption: Embedder bge_v1_v1.5 bge_m3 bge_icl bge_vl bge_code .. toctree:: :maxdepth: 1 :caption: Reranker bge_reranker bge_reranker_v2 --- Community ========= Visit our `GitHub repo `_ and `Hugging Face collection `_ for more materials! We are also holding WeChat groups for for BGE. Scan the QR code to join the group chat! To get the first hand message about our updates and new release, or having any questions or ideas, join us now! .. figure:: ../_static/img/BGE_WeChat_Group.png :width: 400 :align: center --- .. FlagEmbedding documentation master file, created by sphinx-quickstart on Sat Oct 12 13:27:49 2024. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. :html_theme.sidebar_secondary.remove: True Welcome to BGE! =============== .. Welcome to BGE documentation! .. figure:: _static/img/bge_panda.jpg :width: 400 :align: center .. grid:: 3 :gutter: 3 .. grid-item-card:: :octicon:`milestone` Introduction New to BGE? Quickly get hands-on information. +++ .. button-ref:: Introduction/index :expand: :color: primary :click-parent: To Introduction .. grid-item-card:: :octicon:`package` BGE Models Get to know BGE embedding models and rerankers. +++ .. button-ref:: bge/index :expand: :color: primary :click-parent: To BGE .. grid-item-card:: :octicon:`log` Tutorials Find useful tutorials to start with if you are looking for guidance +++ .. button-ref:: tutorial/index :expand: :color: primary :click-parent: To Tutorials .. grid-item-card:: :octicon:`codescan` API Check the API of classes and functions in FlagEmbedding. +++ .. button-ref:: API/index :expand: :color: primary :click-parent: To APIs .. grid-item-card:: :octicon:`question` FAQ Take a look of questions people frequently asked. +++ .. button-ref:: FAQ/index :expand: :color: primary :click-parent: To FAQ .. grid-item-card:: :octicon:`people` Community Welcome to join BGE community! +++ .. button-ref:: community/index :expand: :color: primary :click-parent: To Community Besides the resources we provide here in this documentation, please visit our `GitHub repo `_ for more contents including: - Want to get familiar with BGE quickly? There are hands-on `examples `_ to run for embedder and reranker's inference, evaluation, and finetuning. - Unfamiliar with some area, keywords or techniques of retrieval and RAG? We provide `tutorials `_ to teach you basic knowledge and coding tips. - Interested in research topics that expanding from BGE and retrieval? Our `research `_ folder contains many exciting topics for you to explore. BGE is developed by Beijing Academy of Artificial Intelligence (BAAI). | .. image:: _static/img/BAAI_logo.png :target: https://github.com/FlagOpen/FlagEmbedding :width: 300 :align: center .. toctree:: :maxdepth: 1 :hidden: Home .. toctree:: :hidden: :maxdepth: 1 :caption: Introduction Introduction/index .. toctree:: :hidden: :maxdepth: 1 :caption: BGE bge/index .. toctree:: :hidden: :maxdepth: 2 :caption: Tutorials tutorial/index .. toctree:: :hidden: :maxdepth: 5 :caption: API API/index .. toctree:: :hidden: :maxdepth: 1 :caption: FAQ FAQ/index .. toctree:: :hidden: :maxdepth: 1 :caption: Community community/index --- 1. Embedding ============ .. toctree:: :hidden: :maxdepth: 1 :caption: Embedding 1_Embedding/1.1.1 1_Embedding/1.2.1 1_Embedding/1.2.2 1_Embedding/1.2.3 1_Embedding/1.2.4 1_Embedding/1.2.5 --- 2. Metrics ========== .. toctree:: :hidden: :maxdepth: 1 :caption: Metrics 2_Metrics/2.1 2_Metrics/2.2 --- 3. Indexing =========== .. toctree:: :hidden: :maxdepth: 1 :caption: Indexing 3_Indexing/3.1.1 3_Indexing/3.1.2 3_Indexing/3.1.3 3_Indexing/3.1.4 3_Indexing/3.1.5 --- 4. Evaluation ============= .. toctree:: :hidden: :maxdepth: 1 :caption: Evaluation 4_Evaluation/4.1.1 4_Evaluation/4.2.1 4_Evaluation/4.2.2 4_Evaluation/4.2.3 4_Evaluation/4.3.1 4_Evaluation/4.4.1 4_Evaluation/4.5.1 4_Evaluation/4.5.2 --- 5. Reranking ============ .. toctree:: :hidden: :maxdepth: 1 :caption: Reranking 5_Reranking/5.1 5_Reranking/5.2 5_Reranking/5.3 --- 6. RAG ====== .. toctree:: :hidden: :maxdepth: 1 :caption: RAG 6_RAG/6.1 6_RAG/6.2 6_RAG/6.3 --- 7. Finetuning ============= .. toctree:: :hidden: :maxdepth: 1 :caption: Finetuning 7_Finetuning/7.1.1 7_Finetuning/7.1.2 7_Finetuning/7.1.3 7_Finetuning/7.2.1 --- Tutorials ========= In this section, we provide hands on introduction to different topics that highly related to embedding models and retrieval. To run the tutorials, clone the GitHub repo and check the `Tutorials `_ folder. .. toctree:: :maxdepth: 1 :caption: Tutorials 1_Embedding 2_Metrics 3_Indexing 4_Evaluation 5_Reranking 6_RAG 7_Finetuning --- # Evaluation Make sure you have created the environment and downloaded the data according to [README](../README.md). ```bash conda activate beacon model=namespace-Pt/beacon-qwen-2-7b-instruct # language modeling perplexity torchrun --nproc_per_node 8 -m main.eval_lm --max_length 100000 --stride 32768 --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 # passkey retrieval accuracy torchrun --nproc_per_node 8 -m main.eval_passkey --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 # needle-in-a-haystack accuracy OPENAI_API_KEY="" torchrun --nproc_per_node 8 -m main.eval_needle --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 --gpt_eval # topic retrieval accuracy torchrun --nproc_per_node 8 -m main.eval_topic --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 # longbench torchrun --nproc_per_node 8 -m main.eval_longbench --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 # infinitebench torchrun --nproc_per_node 8 -m main.eval_infbench --model_name_or_path $model --enable_beacon --beacon_ratio_mix adapt-1024 ``` All evaluation results will be saved at `data/results`. --- # Training There are two stages in training: - Pretrain - 1B token from [redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample) with auto-regressive language modeling - Add eos to each document and no packing - 20K context length at maximum - Finetune - 5K samples from [LongAlpaca](https://huggingface.co/datasets/Yukang/LongAlpaca-12k), 2K samples from [Booksum](https://huggingface.co/datasets/kmfoda/booksum), 16K synthetic long-context QA data from GPT-3.5, and 5K samples from pretraining data - 20K context length at maximum ## Prerequisite Make sure you have created the environment and downloaded the data according to [README](../README.md). ### Mistral #### Pretrain ```bash output_name=beacon-mistral-pretrain torchrun --nproc_per_node 8 $DDP -m main.train \ --output_dir data/outputs/$output_name \ --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \ --train_data long-llm:redpajama/train.json \ --min_length 2400 \ --max_length 20000 \ --group_by_stride strict \ --enable_beacon \ --beacon_window 2048 \ --beacon_stride 2048 \ --beacon_attn full-coverage \ --beacon_attend_prev True \ --beacon_sink_size 0 \ --beacon_ratio 2 4 8 16 32 \ --beacon_ratio_mix step-random \ --beacon_param q k v \ --beacon_pos interleave \ --attn_impl flash_attention_2 \ --gradient_checkpointing \ --use_reentrant False \ --save_only_model \ --save_strategy epoch \ --evaluation_strategy steps \ --num_train_epochs 1 \ --logging_steps 50 \ --bf16 \ --deepspeed data/deepspeed/stage2.json ``` #### Finetune ```bash output_name=beacon-mistral-finetune torchrun --nproc_per_node 8 $DDP -m main.train \ --output_dir data/outputs/$output_name \ --model_name_or_path data/outputs/beacon-mistral-pretrain/* \ --train_data long-llm:gpt/one_detail_book.train.16K.json long-llm:gpt/one_detail_paper.train.16K.json long-llm:longalpaca/train.json long-llm:booksum/train.16K.json long-llm:needle/train.16K.json long-llm:redpajama/train.json[5000] \ --max_length 20000 \ --min_length 7200 \ --group_by_stride strict \ --enable_beacon \ --beacon_window 2048 \ --beacon_stride 2048 \ --beacon_attn full-coverage \ --beacon_attend_prev True \ --beacon_sink_size 0 \ --beacon_ratio 2 4 8 \ --beacon_ratio_mix step-random \ --beacon_param q k v \ --beacon_pos interleave \ --attn_impl flash_attention_2 \ --learning_rate 1e-5 \ --gradient_checkpointing \ --use_reentrant False \ --save_only_model \ --num_train_epochs 1 \ --save_strategy epoch \ --logging_steps 50 \ --bf16 \ --deepspeed data/deepspeed/stage2.json \ --chat_template mistral ``` ### Llama-3 NOTE: according to our experiment, Llama-3 requires attention sink. #### Pretrain ```bash output_name=beacon-llama3-pretrain torchrun --nproc_per_node 8 $DDP -m main.train \ --output_dir data/outputs/$output_name \ --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \ --train_data long-llm:redpajama/train.json \ --min_length 2400 \ --max_length 20000 \ --group_by_stride strict \ --enable_beacon \ --beacon_window 1024 \ --beacon_stride 1024 \ --beacon_attn full-coverage \ --beacon_attend_prev True \ --beacon_sink_size 1 \ --beacon_ratio 2 4 8 16 32 \ --beacon_ratio_mix step-random \ --beacon_param q k v \ --beacon_pos interleave \ --attn_impl flash_attention_2 \ --gradient_checkpointing \ --use_reentrant False \ --save_only_model \ --save_strategy epoch \ --evaluation_strategy steps \ --num_train_epochs 1 \ --logging_steps 50 \ --bf16 \ --deepspeed data/deepspeed/stage2.json ``` #### Finetune ```bash output_name=beacon-llama3-finetune torchrun --nproc_per_node 8 $DDP -m main.train \ --output_dir data/outputs/$output_name \ --model_name_or_path data/outputs/beacon-llama3-pretrain/* \ --train_data long-llm:gpt/one_detail_book.train.16K.json long-llm:gpt/one_detail_paper.train.16K.json long-llm:longalpaca/train.json long-llm:booksum/train.16K.json long-llm:needle/train.16K.json long-llm:redpajama/train.json[5000] \ --max_length 20000 \ --min_length 7200 \ --group_by_stride strict \ --enable_beacon \ --beacon_window 1024 \ --beacon_stride 1024 \ --beacon_attn full-coverage \ --beacon_attend_prev True \ --beacon_sink_size 1 \ --beacon_ratio 2 4 8 \ --beacon_ratio_mix step-random \ --beacon_param q k v \ --beacon_pos interleave \ --attn_impl flash_attention_2 \ --learning_rate 1e-5 \ --gradient_checkpointing \ --use_reentrant False \ --save_only_model \ --num_train_epochs 1 \ --save_strategy epoch \ --logging_steps 50 \ --bf16 \ --deepspeed data/deepspeed/stage2.json \ --chat_template llama-3 ``` ### Qwen-2 #### Pretrain ```bash output_name=beacon-qwen2-pretrain torchrun --nproc_per_node 8 $DDP -m main.train \ --output_dir data/outputs/$output_name \ --model_name_or_path Qwen/Qwen2-7B-Instruct \ --train_data long-llm:redpajama/train.json \ --min_length 2400 \ --max_length 20000 \ --group_by_stride strict \ --enable_beacon \ --beacon_window 2048 \ --beacon_stride 2048 \ --beacon_attn full-coverage \ --beacon_attend_prev True \ --beacon_sink_size 0 \ --beacon_ratio 2 4 8 16 32 \ --beacon_ratio_mix step-random \ --beacon_param q k v \ --beacon_pos interleave \ --attn_impl flash_attention_2 \ --gradient_checkpointing \ --use_reentrant False \ --save_only_model \ --save_strategy epoch \ --evaluation_strategy steps \ --num_train_epochs 1 \ --logging_steps 50 \ --bf16 \ --deepspeed data/deepspeed/stage2.json ``` #### Finetune ```bash torchrun --nproc_per_node 8 $DDP -m main.train \ --output_dir data/outputs/$output_name \ --model_name_or_path data/outputs/beacon-qwen2-pretrain/* \ --train_data long-llm:gpt/one_detail_book.train.16K.json long-llm:gpt/one_detail_paper.train.16K.json long-llm:longalpaca/train.json long-llm:booksum/train.16K.json long-llm:needle/train.16K.json long-llm:redpajama/train.json[5000] \ --max_length 20000 \ --min_length 7200 \ --group_by_stride strict \ --enable_beacon \ --beacon_window 2048 \ --beacon_stride 2048 \ --beacon_attn full-coverage \ --beacon_attend_prev True \ --beacon_sink_size 0 \ --beacon_ratio 2 4 8 \ --beacon_ratio_mix step-random \ --beacon_param q k v \ --beacon_pos interleave \ --attn_impl flash_attention_2 \ --learning_rate 1e-5 \ --gradient_checkpointing \ --use_reentrant False \ --save_only_model \ --num_train_epochs 1 \ --save_strategy epoch \ --logging_steps 50 \ --bf16 \ --deepspeed data/deepspeed/stage2.json \ --chat_template qwen ``` --- # Evaluation LLM-Embedder supports 6 retrieval-augmentation tasks tailored for modern LLMs, including: - Question Answering (qa) - evaluate with `eval_popqa` and `eval_mmlu` - In-Context Learning (icl) - evaluate with `eval_icl` - Long Conversation (chat) - evaluate with `eval_msc` - Long-Range Language Modeling (lrlm) - evaluate with `eval_lrlm` - Tool Learning (tool) - evaluate with `eval_tool` - Conversational Search (convsearch) - evaluate with `eval_qrecc` ## Environment It is recommended that you create a new environment: ``` cd FlagEmbedding/llm_embedder conda env create -f environment.yaml --name llm-embedder conda activate llm-embedder ``` To use BM25, you must download **java11** and **anserini**, then add java to your `PATH`: ```bash # feel free to alternate /data to your prefered location wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/java11.tar.gz?download=true -O /data/java11.tar.gz wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/anserini.tar.gz?download=true -O /data/anserini.tar.gz cd /data tar -xzvf java11.tar.gz tar -xzvf anserini.tar.gz # below just temporarily set JAVA_HOME; it is RECOMMENDED that you store the lines the setting in ~/.bashrc export JAVA_HOME=/data/jdk-11.0.2 export PATH=$JAVA_HOME/bin:$PATH ``` ## Data You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. `/data`, which results in a folder `/data/llm-embedder`: ```bash # feel free to alternate /data to your prefered location wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/llm-embedder.tar.gz?download=true -O /data/llm-embedder.tar.gz cd /data tar -xzvf llm-embedder-eval.tar.gz ``` The corpus of QReCC for conversational search is too large (54M passages), we separately upload it to huggingface datasets [namespace-Pt/qrecc-corpus](https://huggingface.co/datasets/namespace-Pt/qrecc-corpus). To evaluate the performance on conversational search, you should load it and save it as json file in the `qrecc` folder: ```python import datasets # load dataset qrecc_corpus = datasets.load_dataset("namespace-Pt/qrecc-corpus", split="train") # save to jsonline format in YOUR data folder qrecc_corpus.to_json("/data/llm-embedder/convsearch/qrecc/corpus.json", force_ascii=False, lines=True, orient="records") ``` ## Benchmark ### Commands Below are commands to run evaluation for different retrieval models. You can replace `eval_popqa` with any of `eval_mmlu`, `eval_icl`, `eval_lrlm`, `eval_msc`, `eval_tool`, and `eval_qrecc`. The results will be logged at `data/results/`. *All our evaluation are based on `meta-llama/Llama-2-7b-chat-hf`. To use different language models, e.g. `Qwen/Qwen-7B-Chat`, simply add `--model_name_or_path Qwen/Qwen-7B-Chat` after every command.* *Note that you can modify the default value of `data_root` in `src/retrieval/args.py`, so that you don't need to type it for each command.* ```bash cd FlagEmbedding/llm_embedder # No retrieval torchrun --nproc_per_node 8 -m evaluation.eval_popqa --retrieval_method no --data_root /data/llm-embedder # Random torchrun --nproc_per_node 8 -m evaluation.eval_popqa --retrieval_method random --data_root /data/llm-embedder # BM25 (anserini_dir is the folder where you untar anserini.tar.gz) torchrun --nproc_per_node 8 -m evaluation.eval_popqa --retrieval_method bm25 --data_root /data/llm-embedder --anserini_dir /data/anserini # Contriever torchrun --nproc_per_node 8 -m evaluation.eval_popqa --query_encoder facebook/Contriever --dense_metric ip --add_instruction False --data_root /data/llm-embedder # BGE torchrun --nproc_per_node 8 -m evaluation.eval_popqa --query_encoder BAAI/bge-base-en --version bge --data_root /data/llm-embedder # AAR (uses special decoder pooling) torchrun --nproc_per_node 8 -m evaluation.eval_popqa --query_encoder OpenMatch/AAR-ANCE --pooling_method decoder --add_instruction False --data_root /data/llm-embedder # APIRetriever torchrun --nproc_per_node 8 -m evaluation.eval_popqa --query_encoder ToolBench/ToolBench_IR_bert_based_uncased --pooling_method mean --dense_metric ip --add_instruction False --data_root /data/llm-embedder # LLMRetriever torchrun --nproc_per_node 8 -m evaluation.eval_popqa --query_encoder intfloat/llm-retriever-base --add_instruction false --pooling_method mean --data_root /data/llm-embedder # RetroMAE_BEIR torchrun --nproc_per_node 8 -m evaluation.eval_popqa --query_encoder Shitao/RetroMAE_BEIR --dense_metric ip --add_instruction False --data_root /data/llm-embedder # LLM Embedder torchrun --nproc_per_node 8 -m evaluation.eval_popqa --query_encoder BAAI/llm-embedder --version llm-embedder --data_root /data/llm-embedder ``` For Instructor, we should first convert it to our format: ```python # convert sentence transformer based Instructor to our format import torch from src.retrieval import DenseRetriever, RetrievalArgs from sentence_transformers import SentenceTransformer model_args = RetrievalArgs( query_encoder = "hkunlp/instructor-base", pooling_method = ["mean", "dense"], dtype = "fp32" ) retriever = DenseRetriever(**asdict(model_args), cache_dir=model_args.model_cache_dir) tokenizer = retriever.tokenizer with torch.no_grad(): sent_model = SentenceTransformer(model_args.query_encoder, device="cpu") retriever.dense_pooler.weight.data = sent_model.state_dict()["2.linear.weight"] x = sent_model.encode(["I love you"]) y = retriever.encode("I love you") print(torch.isclose(torch.from_numpy(x), y)) retriever.save_pretrained("data/outputs/instructor-base") ``` Then we evaluate with ```bash torchrun --nproc_per_node 8 -m evaluation.eval_popqa --query_encoder data/outputs/instructor-base/encoder --pooling_method mean dense --version instructor --data_root /data/llm-embedder ``` ### Leaderboard All the following results are based on `meta-llama/Llama-27b-chat-hf` with `torch==2.0.1`, `transformers==4.30.0` on a `8xA100` machine with `CUDA==11.4`. |Model|MMLU (avg)|PopQA (acc)|In-Context Learning (avg)|Long Conversation (ppl)|Long-Range Language Modeling (ppl)|Tool Learning (ndcg)|Conversational Search (ndcg)| |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| |None|0.4599|0.2061|0.4645|19.3501|6.4003|--|--| |BM25|0.4721|0.3491|0.484|14.6512|6.1558|0.5115|0.4341| |Instructor|0.4721|0.3533|0.6036|14.8799|6.1733|0.3882|0.2863| |Contriever|0.4684|0.3276|0.6009|14.2129|6.1305|0.4904|0.3563| |BGE|0.4896|0.4491|0.5974|14.2943|6.1335|0.5761|0.3856| |AAR|0.4826|0.4792|0.5938|14.6999|6.1528|0.42|0.2877| |LLMRetriever|0.4625|0.2506|0.6262|14.4746|6.1750|0.1321|0.0234| |APIRetriever|0.4625|0.2488|0.5945|14.7834|6.1833|0.8017|0.1137| |LLM-Embedder (ours)|**0.4903**|**0.5052**|**0.6288**|**13.4832**|**6.0972**|**0.8645**|**0.5053**| --- # Fine-tuning ## Environment It is recommended that you create a new environment: ``` cd FlagEmbedding/llm_embedder conda env create -f environment.yaml --name llm-embedder conda activate llm-embedder ``` To use BM25, you must download **java11** and **anserini**, then add java to your `PATH`: ```bash # feel free to alternate /data to your prefered location wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/java11.tar.gz?download=true -O /data/java11.tar.gz wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/anserini.tar.gz?download=true -O /data/anserini.tar.gz cd /data tar -xzvf java11.tar.gz tar -xzvf anserini.tar.gz # below just temporarily set JAVA_HOME; it is RECOMMENDED that you store the lines the setting in ~/.bashrc export JAVA_HOME=/data/jdk-11.0.2 export PATH=$JAVA_HOME/bin:$PATH ``` ## Data You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. `/data`, which results in a folder `/data/llm-embedder`: ```bash # feel free to alternate /data to your prefered location wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/llm-embedder.tar.gz?download=true -O /data/llm-embedder.tar.gz cd /data tar -xzvf llm-embedder-eval.tar.gz ``` The corpus of QReCC for conversational search is too large (54M passages), we separately upload it to huggingface datasets [namespace-Pt/qrecc-corpus](https://huggingface.co/datasets/namespace-Pt/qrecc-corpus). To evaluate the performance on conversational search, you should load it and save it as json file in the `qrecc` folder: ```python import datasets # load dataset qrecc_corpus = datasets.load_dataset("namespace-Pt/qrecc-corpus", split="train") # save to jsonline format in YOUR data folder qrecc_corpus.to_json("/data/llm-embedder/convsearch/qrecc/corpus.json", force_ascii=False, lines=True, orient="records") ``` The data formats for training and evaluation are as follows: ```python # training { "query": str, "pos": List[str], "neg": List[str], "pos_index": Optional[List[int]], # Indices of the positives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field. "neg_index": Optional[List[int]], # Indices of the negatives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field. "teacher_scores": Optional[List[float]], # Scores from an LM or a reranker, used for distillation. "answers": Optional[List[str]], # List of answers for the query, used for LM scoring. } # evaluation { "query": str, "pos_index": Optional[List[int]], # Indices of the positives w.r.t. corpus. When there is no positives pre-defined (e.g. NQ), just ignore this field. "answers": Optional[List[str]], # List of answers for computing NQ metrics. "key": Optional[List[str]], # Retrieval results of the query. Usually used for RAG or reranking. "key_index": Optional[List[int]], # Key indices w.r.t. the corpus. } ``` ## Retriever Below are several important arguments for training. The meaning and usage of other arguments can be inspected from [code](../src/retrieval/args.py) or running `python run_dense.py --help` from command line. - `train_data`: required, one or a list of json files with the aforementioned formatting. - `eval_data`: optional, one json file with the aforementioned formatting. If an `eval_data` is speficied, the trainer will automatically do evaluation on the `eval_data`. - `corpus`: optional, the global corpus where `positives`. **IMPORTANT NOTE** - For any path specified for `train_data`, `eval_data`, and `corpus`: if it is prefixed with `llm-embedder`, it will be solved to the relative path against [`data_root`](../src/retrieval/args.py). *Note that you can modify the default value of `data_root`, so that you don't need to type it for each command.* - During fine-tuning, we save the output model in the `huggingface transformers`🤗 format. To use it from `sentence_transformers`, you should convert it to `sentence_transformers` checkpoint in advance: ```bash python scripts/ours2st.py --encoder data/outputs/your-output-dir/encoder ``` Then everything is the same as described in [README](../README.md). ### LLM-Embedder (Multi-Task Fine-Tune) ```bash # Remember to modify the data_root to your data root in the script :) bash scripts/llm-embedder.sh ``` ### Single Task Fine-Tune Below we provide commands to fine-tune a retriever on a single task. #### QA ```bash torchrun --nproc_per_node=8 run_dense.py \ --output_dir data/outputs/nq \ --train_data llm-embedder:qa/nq/train.json \ --eval_data llm-embedder:qa/nq/test.json \ --corpus llm-embedder:qa/nq/corpus.json \ --metrics nq \ --key_max_length 128 \ --query_max_length 32 \ --contrastive_weight 0 \ --stable_distill \ --eval_steps 2000 \ --save_steps 2000 \ --max_steps 2000 \ --data_root /data/llm-embedder ``` #### In-Context Learning ```bash torchrun --nproc_per_node=8 run_dense.py \ --output_dir data/outputs/icl \ --train_data llm-embedder:icl/icl/train.json \ --select_positive random \ --contrastive_weight 0 \ --stable_distill \ --save_steps 6000 \ --max_steps 6000 \ --data_root /data/llm-embedder ``` #### Long-Range Language Modeling ```bash torchrun --nproc_per_node=8 run_dense.py \ --output_dir data/outputs/lrlm \ --train_data llm-embedder:lrlm/books3/train.json llm-embedder:lrlm/arxiv/train.json llm-embedder:lrlm/codeparrot/train.json \ --select_positive teacher \ --teacher_scores_margin 0.1 \ --contrastive_weight 0 \ --teacher_temperature 0.1 \ --save_steps 4000 \ --max_steps 4000 \ --data_root /data/llm-embedder ``` #### Long Chat ```bash torchrun --nproc_per_node=8 run_dense.py \ --output_dir data/outputs/msc \ --train_data llm-embedder:chat/msc/train.json \ --select_positive teacher \ --select_negative random \ --contrastive_weight 0 \ --teacher_temperature 0.1 \ --save_steps 4000 \ --max_steps 4000 \ --data_root /data/llm-embedder ``` #### Tool ```bash torchrun --nproc_per_node=8 run_dense.py \ --output_dir data/outputs/tool \ --train_data llm-embedder:tool/toolbench/train.json \ --eval_data llm-embedder:tool/toolbench/test.json \ --corpus llm-embedder:tool/toolbench/corpus.json \ --key_template {text} \ --metrics ndcg \ --eval_steps 2000 \ --save_steps 2000 \ --max_steps 2000 \ --data_root /data/llm-embedder ``` #### Conversation Search ```bash torchrun --nproc_per_node=8 run_dense.py \ --output_dir data/outputs/qrecc \ --train_data llm-embedder:conversation/qrecc/train.concat.json \ --eval_data llm-embedder:conversation/qrecc/test.concat.json \ --corpus llm-embedder:conversation/qrecc/corpus.json \ --key_template '{text}' \ --metrics mrr ndcg \ --cutoffs 3 10 100 \ --eval_steps 2000 \ --save_steps 2000 \ --max_steps 2000 \ --data_root /data/llm-embedder ``` ### Mine Negatives ```bash # BGE (the result will be saved at llm-embedder:qa/nq/train.neg.bge.json) torchrun --nproc_per_node=8 -m evaluation.eval_retrieval \ --eval_data llm-embedder:qa/nq/train.json \ --corpus llm-embedder:qa/nq/corpus.json \ --metrics mrr recall collate_neg \ --save_name bge \ --data_root /data/llm-embedder # BM25 (the result will be saved at llm-embedder:qa/nq/train.neg.bm25.json; anserini_dir is the folder where you untar anserini.tar.gz) torchrun --nproc_per_node 8 -m evaluation.eval_retrieval \ --anserini_dir /data/anserini \ --retrieval_method bm25 \ --eval_data llm-embedder:qa/nq/train.json \ --corpus llm-embedder:qa/nq/corpus.json \ --metrics mrr recall collate_neg \ --save_name bm25 \ --data_root /data/llm-embedder ``` ## LM Scoring Score positives and negatives in `eval_data` with $p(o|q,k)$ where $o$ is the desired output (i.e. `answers` field), $q$ is the query, and $k$ is a key (could be positive or negative). ```bash torchrun --nproc_per_node=8 run_lm_score.py \ --eval_data llm-embedder:qa/msmarco/train.json \ --data_root /data/llm-embedder \ --model_name_or_path meta-llama/Llama-2-7b-chat-hf \ --save_name llama2-7b-chat ``` Results will be saved at `/data/llm-embedder/qa/msmarco/train.scored.llama2-7b-chat.json` ## Known Issues - `transformers==4.30.0` raises error when using deepspeed schedulerconfig - modify line `1750` in `trainer.py` ```python if use_accelerator_prepare: # NOTE: fix bug in transformers 4.30.0 # model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) self.model.train() if hasattr(self.lr_scheduler, "step"): if self.use_apex: model = self.accelerator.prepare(self.model) else: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) else: # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config. model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( self.model, self.optimizer, self.lr_scheduler ) ``` --- # Q&A Example Vector Database can help LLMs to access external knowledge. You can load baai-general-embedding as the encoder to generate the vectors. Here a example to build a bot which can answer your question using the knowledge in chinese wikipedia. Here's a description of the Q&A dialogue scenario using flag embedding and a large language model: 1. **Data Preprocessing and Indexing:** - Download a Chinese wikipedia dataset. - Encode the Chinese wikipedia text using flag embedding. - Build an index using BM25. 2. **Query Enhancement with Large Language Model (LLM):** - Utilize a Large Language Model (LLM) to enhance and enrich the original user query based on the chat history. - The LLM can perform tasks such as text completion and paraphrasing to make the query more robust and comprehensive. 3. **Document Retrieval:** - Employ BM25 to retrieve the top-n documents from the locally stored Chinese wiki dataset based on the newly enhanced query. 4. **Embedding Retrieval:** - Perform an embedding retrieval on the top-n retrieved documents using brute force search to get top-k documents. 5. **Answer Retrieval with Language Model (LLM):** - Present the question, the top-k retrieved documents, and chat history to the Large Language Model (LLM). - The LLM can utilize its understanding of language and context to provide accurate and comprehensive answers to the user's question. By following these steps, the Q&A system can leverage flag embedding, BM25 indexing, and a Large Language Model to improve the accuracy and intelligence of the system. The integration of these techniques can create a more sophisticated and reliable Q&A system for users, providing them with comprehensive information to effectively answer their questions. ### Installation ```shell sudo apt install default-jdk pip install -r requirements.txt conda install -c anaconda openjdk ``` ### Prepare Data ```shell python pre_process.py --data_path ./data ``` This script will download the dataset (Chinese wikipedia), building BM25 index, inference embedding, and then save them to `data_path`. ## Q&A usage ### Run Directly ```shell export OPENAI_API_KEY=... python run.py --data_path ./data ``` This script will build a Q&A dialogue scenario. ### Quick Start ```python # encoding=gbk from tool import LocalDatasetLoader, BMVectorIndex, Agent loader = LocalDatasetLoader(data_path="./data/dataset", embedding_path="./data/emb/data.npy") index = BMVectorIndex(model_path="BAAI/bge-large-zh", bm_index_path="./data/index", data_loader=loader) agent = Agent(index) question = "上次有人登月是什么时候" agent.Answer(question, RANKING=1000, TOP_N=5, verbose=False) ```