Skip to content

Referencias bibliográficas

Selección curada de papers, frameworks y guías citados en el Manual QA AI v13 (Apéndice B). Formato canónico: autor (año). título. venue. arXiv:ID o URL.

URLs verificadas

URLs verificadas a fecha de edición (2026-04). El ecosistema cambia rápido; revisar si las URLs siguen vigentes antes de citar.

Estándares y guías

  • OWASP. (2025). OWASP Top 10 for LLM Applications. genai.owasp.org/llm-top-10
  • NIST. (2023). AI Risk Management Framework (AI RMF 1.0). nist.gov/itl/ai-risk-management-framework
  • EU. (2024). Artificial Intelligence Act. Reglamento UE 2024/1689. digital-strategy.ec.europa.eu/policies/ai-act
  • ISO/IEC. (2023). ISO/IEC 42001:2023 — AI Management Systems.
  • ISO/IEC. (2022). ISO/IEC 23053:2022 — Framework for AI Systems Using ML.
  • ISO/IEC. (2023). ISO/IEC 23894:2023 — Guidance on Risk Management for AI.
  • ISTQB. (2019). Certified Tester AI Testing (CT-AI) — Syllabus v1.0. istqb.org. Edición original; revisar la versión vigente.
  • ISTQB. (2025). Certified Tester Generative AI Testing (CT-GenAI) — Syllabus v1.0. Julio 2025; traducción ES diciembre 2025.

Fundamentos de IA y arquitectura de LLMs

  • Russell, S., Norvig, P. (2021). Artificial Intelligence: A Modern Approach. 4ª ed. Pearson.
  • Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. arXiv:1706.03762.

Frameworks y métricas RAG

  • Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
  • Saad-Falcon, J. et al. (2023). ARES: An Automated Evaluation Framework for RAG. arXiv:2311.09476.
  • Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997.
  • Gao, L. et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). arXiv:2212.10496.
  • Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique. arXiv:2310.11511.
  • Pradeep, R. et al. (2021). The Expando-Mono-Duo Design Pattern for Text Ranking. arXiv:2101.05667.

LLM-as-Judge y evaluación

  • Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS. arXiv:2306.05685.
  • Liu, Y. et al. (2023). G-Eval: NLG Evaluation using GPT-4. arXiv:2303.16634.
  • Chen, C. et al. (2024). Humans or LLMs as the Judge? A Study on Judgement Bias. arXiv:2402.10669.

Seguridad y prompt injection

  • Greshake, K., Abdelnabi, S., Mishra, S. et al. (2023). Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173.
  • Perez, F., Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527.
  • Wei, A. et al. (2023). Jailbroken: How Does LLM Safety Training Fail?. NeurIPS. arXiv:2307.02483.

Bias, fairness y safety

  • Suresh, H., Guttag, J. (2021). A Framework for Understanding Sources of Harm Throughout the ML Life Cycle. EAAMO. arXiv:1901.10002.
  • Gallegos, I. O. et al. (2024). Bias and Fairness in Large Language Models: A Survey. Computational Linguistics. arXiv:2309.00770.
  • Bender, E. et al. (2021). On the Dangers of Stochastic Parrots. FAccT 2021.
  • Mitchell, M. et al. (2019). Model Cards for Model Reporting. FAT*.
  • Gebru, T. et al. (2021). Datasheets for Datasets. Communications of the ACM. arXiv:1803.09010.

Alucinaciones y reasoning errors

  • Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys.
  • Manakul, P. et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection. arXiv:2303.08896.
  • Huang, L. et al. (2023). A Survey on Hallucination in LLMs. arXiv:2311.05232.
  • Mirzadeh, I. et al. (2024). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. arXiv:2410.05229.

Robustness

  • Ribeiro, M. et al. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. ACL.
  • Zhu, K. et al. (2023). PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. arXiv:2306.04528.
  • Morris, J. et al. (2020). TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. EMNLP.

Privacy y PII

  • Carlini, N. et al. (2021). Extracting Training Data from Large Language Models. USENIX Security.
  • Yao, Y. et al. (2024). LLM-PBE: Assessing Data Privacy in Large Language Models. arXiv:2408.12787.
  • Microsoft. Presidio — PII detection and anonymization. github.com/microsoft/presidio.

Inter-annotator agreement

  • Landis, J., Koch, G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics 33(1).
  • Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. 4ª ed. SAGE.
  • Hayes, A., Krippendorff, K. (2007). Answering the Call for a Standard Reliability Measure for Coding Data. Communication Methods.

Consumo energético e impacto medioambiental

  • Luccioni, A. S., Jernite, Y., Strubell, E. (2024a). Power Hungry Processing: Watts Driving the Cost of AI Deployment?. FAccT 2024. arXiv:2311.16863.
  • Berthelot, A. et al. (2024). Estimating the environmental impact of Generative-AI services using an LCA-based methodology. arXiv:2401.14878.

Frameworks y herramientas

HerramientaRepositorio
RAGASgithub.com/explodinggradients/ragas
TruLensgithub.com/truera/trulens
DeepEvalgithub.com/confident-ai/deepeval
Langfusegithub.com/langfuse/langfuse
Arize Phoenixgithub.com/Arize-ai/phoenix
Argillagithub.com/argilla-io/argilla
ranxgithub.com/AmenRa/ranx
PromptBenchgithub.com/microsoft/promptbench
TextAttackgithub.com/QData/TextAttack
NL-Augmentergithub.com/GEM-benchmark/NL-Augmenter
LlamaGuard (Meta)huggingface.co/meta-llama/LlamaGuard-7b
NeMo Guardrails (NVIDIA)github.com/NVIDIA/NeMo-Guardrails
jsonschema (Python)github.com/python-jsonschema/jsonschema

Cómo citar este manual

Si usas el Manual QA AI v13 o este laboratorio en un paper, charla o curso:

Moreno Cominero, G. (2026). Manual de QA para Sistemas de Inteligencia Artificial.
Versión 13. Edición 2026. ai-testing-lab.vercel.app.

Cómo verificar las referencias

Las URLs y arXiv IDs se verifican a fecha de edición. Si encuentras un enlace roto o un paper retirado, abre un issue en github.com/gonzaloMorenoc/ai-testing-lab/issues con el ID afectado.

MIT License · GitHub