this is a article about data collection, which is a important part of LLM data engineering, and also a hot topic in the field of web crawler.
Jessy Tsui
April 12, 2024
More than a year after the advent of ChatGPT, people have gradually understood the operational logic of large models and are actively exploring their application scenarios. We firmly believe that the field of large models is data-centric, not model-centric. Therefore, in the exploration of the era of large models, we have accumulated a wealth of experience and insights about data, which we have compiled into the series "From Data to AGI: Unlocking the Secrets of Large Language Models." Based on these experiences, we have also developed the product DataTager, which will be launched soon.
The role of data in large models is indisputable. Collecting data reasonably and choosing which data to collect are very important topics. Next, we will explore several major data collection methods in detail, analyzing their advantages, disadvantages, and practical applications.
When we find suitable data on a website, the first thought is to download all of it to our local system for model training, hence the use of web crawlers.
pip install scrapy cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://www.zyte.com/blog/'] def parse(self, response): for title in response.css('.oxy-post-title'): yield {'title': title.css('::text').get()} for next_page in response.css('a.next'): yield response.follow(next_page, self.parse) EOF scrapy runspider myspider.py
Beautiful Soup: A Python library for parsing HTML and XML documents, suitable for small-scale data scraping.
Selenium: A tool for automating web browsing, capable of handling dynamically loaded web page content.
Category | Scrapy | Beautiful Soup | Selenium | GPT-Crawler | Scrapegraph-AI | MarkdownDown | Jina Reader |
---|---|---|---|---|---|---|---|
Concepts and Principles | A powerful Python web crawling framework suitable for large-scale crawling projects. | A Python library for parsing HTML and XML documents, suitable for small-scale data scraping. | A tool for automating web browsing, capable of handling dynamically loaded web page content. | A crawler tool combining GPT-3 capabilities, capable of understanding and processing complex web structures. | Utilizes graph neural networks for data extraction, suitable for complex data integration. | Specializes in generating structured Markdown files from web content, facilitating data organization and usage. | Uses AI technology to extract and summarize key information from web pages, enhancing the efficiency and accuracy of data collection. |
Advantages | Modular design, supports multi-threading, strong scraping and processing capabilities. | Easy to learn and use, quickly parses and handles HTML content. | Supports JavaScript rendering, can simulate user actions. | Strong natural language processing capabilities, high automation, and context understanding. | Handles complex relational data, efficiently integrates data from multiple sources. | Generates structured documents, easy to edit and share. | Automated information extraction, intelligent summarization. |
Disadvantages | Requires writing extensive code, not beginner-friendly. | Low efficiency in processing large-scale data, does not support asynchronous operations. | Slow and resource-intensive. | May require more computational resources and preliminary configuration. | Technically complex, high initial learning cost. | Only suitable for extracting data in specific formats. | Depends on the accuracy and performance of AI models. |
Use Cases | Large-scale data scraping, e.g., e-commerce, news portals. | Websites with simple structures and smaller data volumes. | Dynamic content websites, e.g., pages with real-time data updates. | Websites with complex structures and those requiring deep understanding. | Data extraction and integration from multiple sources. | Websites that need to convert web content into readable documents. | Websites that need to quickly gather and summarize information. |
Actual Examples | Using Scrapy to scrape Amazon product data to analyze market trends. | Using Beautiful Soup to scrape blog post content for text analysis. | Using Selenium to scrape dynamically loaded news websites to obtain the latest news articles. | Using GPT-Crawler to scrape technology blogs, automatically categorizing and summarizing content. | Using Scrapegraph-AI to scrape social network data, analyzing user relationships and interactions. | Using MarkdownDown to scrape technical documentation websites, converting content into Markdown files for internal use. | Using Jina Reader to scrape financial news websites, extracting and summarizing market dynamics. |
Category | Traditional Crawling Tools | New AI Crawling Tools |
---|---|---|
Understanding | Dependent on predefined rules and structures, handling complex web pages may be difficult. Advantage: Efficiently handles simple web pages. Disadvantage: Limited ability to handle complex or dynamic content, requires manual configuration. | Utilizes NLP and graph neural networks to better understand complex web structures. Advantage: Strong understanding, can automatically adjust scraping strategies. Disadvantage: May require more computational resources and preliminary configuration. |
Flexibility | Requires manual coding to handle different types of web pages, low flexibility. Advantage: High execution efficiency after specific task optimization. Disadvantage: Difficult to adapt to new types of websites or structural changes. | High adaptability, can automatically adjust scraping strategies based on web content. Advantage: Flexibly handles different websites and content structures. Disadvantage: Initial setup and training time can be lengthy. |
Efficiency | High efficiency in processing large-scale data, but requires extensive upfront configuration. Advantage: Efficiently handles data with known structures. Disadvantage: Large amount of upfront configuration work, difficult to cope with structural changes. | Intelligent analysis and automated processes enhance data scraping efficiency and accuracy. Advantage: High degree of automation, reduces manual intervention. Disadvantage: Requires more resources during operation. |
Usability | Requires writing and maintaining extensive code, steep learning curve. Advantage: Rich technical documentation and community support. Disadvantage: Not beginner-friendly, requires upfront configuration and ongoing maintenance. | Provides user-friendly interfaces and automated features, lowering the barrier to entry. Advantage: Easy to start, reduces coding requirements. Disadvantage: Needs to understand complex AI models and configurations. |
Cost | Open-source tools are usually free, but require extensive development and maintenance resources. Advantage: Low usage costs. Disadvantage: High hidden costs (such as development time and maintenance). | May require payment for use or subscription services, especially for commercial solutions. Advantage: Lowers development and maintenance costs. Disadvantage: High initial investment. |
Suitability | Suitable for websites with known structures and rules, especially static web pages. Advantage: Efficiently scrapes stable-structure websites. Disadvantage: Poor adaptability to dynamic or frequently changing websites. | Suitable for complex structures and dynamic content websites, automatically adapts and adjusts scraping strategies. Advantage: Applicable to various types of websites. Disadvantage: May be overly complex for simple structure websites. |
Legal and Ethical | Manual compliance with data privacy laws and website scraping rules is necessary. Advantage: Clear legal boundaries. Disadvantage: Requires extensive manual checking and adjustment. | Built-in compliance checks and privacy protection features, automatically adhering to legal and ethical standards. Advantage: Reduces legal risks and ethical issues. Disadvantage: Dependent on tool compliance. |
When conducting data scraping, data privacy is a critical factor to consider. With increasing global attention on data privacy, various countries have introduced related laws and regulations, such as the General Data Protection Regulation (GDPR) in the EU and the California Consumer Privacy Act (CCPA) in the USA. These regulations aim to protect user privacy by preventing unauthorized collection and use of personal data.
robots.txt is a text file used to tell search engines and other crawlers which pages can be scraped and which cannot. Website administrators define crawler access rules by placing a robots.txt file in the root directory of their websites.
The website https://darkvisitors.com/ lists the User Agents used by major AI companies' crawlers and tells you how to block these crawlers in your robots.txt file.
Five years ago, when I was with Sina and Sohu, we had a department for a massive distributed crawler system, which covered virtually all issues a crawler might encounter. Thus, I also have extensive experience in this field, which I share here to inspire others.
The core of a crawler is to simulate all human actions and then automate them.
(The code for this section can be viewed here: https://github.com/PandaVT/DataTager/blob/main/blog_code/ai_crawler.py)
There's a principle in code engineering: abstraction.
Why talk about abstraction?
Building a universal AI crawler seems like a grand and technically challenging task, which can feel overwhelming. So, we can start by setting aside the perceived difficulty of the concept and consider the following problem-solving logic:
After some thought, it's clear:
input: target website URL, content to scrape output: file containing the scraped content
This preliminary definition helps us consider:
It sounds challenging because there isn't a straightforward method to directly convert this input into output, and there are several other complex steps in between. The next question is:
From here, we can outline a clear logic chain:
step1: fetch_html.py input: target website URL output: HTML content of the target website step2: parse_content.py input: HTML content of the target website + content to scrape output: Specific data parsed (such as text, images, etc.) step3: process_data.py input: Specific data parsed (such as text, images, etc.) output: Further processing of the data, such as text cleaning, format standardization, etc. step4: save_data.py input: Processed data output: Saved file
Taking https://datatager.com/blog as an example, how were steps 1-4 executed?
First, we need a function that can fetch HTML content from a given URL.
import requests def fetch_html(url): try: response = requests.get(url) response.raise_for_status() return response.text except requests.RequestException as e: print(f"Error fetching {url}: {e}") return None
Next, parse the HTML content to find all blog article titles and links.
from bs4 import BeautifulSoup def parse_content(html_content): soup = BeautifulSoup(html_content, 'html.parser') articles = soup.find_all('h2', class_='entry-title') # Assume titles are within <h2 class="entry-title"> tags data = [] for article in articles: title = article.find('a').get_text() # Extract the text of the title link = article.find('a')['href'] # Extract the link data.append({'title': title, 'link': link}) return data
This step could be used to further process the data, such as simplifying or formatting. In this example, we might not need complex processing.
def clean_text(text): # Simple text cleaning logic cleaned_text = text.replace('\n', ' ').strip() return cleaned_text def process_data(data): return [clean_text(text) for text in data]
Finally, we need a function to save the data to a file.
import json def save_data(data, filename='data.json'): with open(filename, 'w', encoding='utf-8') as f: json.dump(data, f, indent=4, ensure_ascii=False)
Finally, we can combine these steps to run the complete crawler.
if name == "__main__": url = 'https://datatager.com/blog' html_content = fetch_html(url) if html_content: parsed_data = parse_content(html_content) processed_data = process_data(parsed_data) save_data(processed_data, 'blog_posts.json') print("Data has been saved to 'blog_posts.json'。")
So when we scrape https://datatager.com/blog, essentially, we are storing the website's HTML and then parsing it to extract the content we want.
Previously, in step 2, we used XPath, CSS selectors, and regular expressions to extract desired elements from HTML.
In step 3, data processing involved simple replace functions, regular expressions, or basic models like BERT or word2vec.
Both of these steps essentially involve tasks related to language understanding, and now we can leverage the natural and powerful language understanding abilities of LLMs to accomplish these tasks.
Of course, there are many details missing here, and a practical, operational project would require much more. However, the above three-step abstraction method can effectively solve complex logical engineering problems. This also forms the prototype of a general AI crawler.
Here is the basic code for this section, which can already run and achieve decent results. Those interested can expand upon this code prototype: https://github.com/PandaVT/DataTager/blob/main/blog_code/ai_crawler.py
PS: Since the first part already detailed the crawler process, the following sections will only briefly introduce them.
For detailed code, please visit: https://github.com/JessyTsu1/Primary_students
code
folder: Contains preliminary and simple analysis code.data
folder: Contains the final data and processing scripts.Multi_stu
folder: Uses Scrapy for crawling.Based on the code analysis above, we might encounter the following issues:
First, attempt a simple scrape and time it to see if brute-force scraping is feasible:
for i in range(10000) url = “https://school.wjszx.com.cn/senior/introduce-{}.html”.format(i) spider(url) class spider(url): ## spider return name, phone, address, url(or return info) class saving(info): # Open an existing Excel file # Open a sheet # In a for loop, store a field in a row for a fixed column # Close Excel
Here's a simple usage introduction:
After installing with pip, in the same directory:
scrapy startproject stu
This will create a Scrapy project directory, with a folder called spiders where you write the crawlers.
yield is roughly equivalent to return.
Several settings in settings.py:
FEED_EXPORT_ENCODING = 'utf-8' ## Change to utf-8 ROBOTSTXT_OBEY = True CONCURRENT_REQUESTS = 100 CONCURRENT_REQUESTS_PER_DOMAIN = 100 CONCURRENT_REQUESTS_PER_IP = 100 #DOWNLOAD_DELAY = 3 ## Set download delays for more advanced websites, to counteract scraping defenses COOKIES_ENABLED = False
Start crawling:
scrapy crawl stu
Start crawling and save the results as stu.json:
scrapy crawl stu -o stu.json
https://github.com/JessyTsu1/google_scholar_spider
https://github.com/JessyTsu1/DouYinSpider
Data collection on e-commerce platforms is a typical application scenario. By scraping product information and user reviews, market analysis and consumer behavior studies can be conducted.
Data scraping from news websites is primarily for obtaining the latest news content, training news classification models, or generating models.
Data analysis on social networks involves scraping and analyzing user relationships and interactions to provide insights into social behavior.
These practical examples demonstrate how different crawling tools are operated and their effects in various application scenarios, helping to better understand how to select and use crawling tools for effective data collection.
In scientific research, researchers often cannot directly obtain data through web scraping for various reasons, and the field typically has fixed benchmark datasets available for researchers to optimize towards specific targets. In this context, open datasets become a crucial resource for researchers to use and benchmark against.
Public datasets are one of the important sources of data for training large language models, originating from various research institutions, government agencies, and open-source communities. When selecting and evaluating public datasets, it's essential to consider the quality, size, and diversity of the datasets to ensure effective model training.
Public datasets can be obtained from several types of sources:
Research Institutions: Many universities and research institutes release a vast amount of high-quality research datasets, typically used for academic research and experiments. For instance, institutions like Stanford University and MIT often release datasets in the fields of machine learning and artificial intelligence.
Government Agencies: Government departments offer a wide array of public datasets covering areas such as the economy, society, and the environment, which can be utilized for various analyses and studies. For example, the US government's Data.gov portal provides a rich resource of public data.
Open Source Communities: Open source communities and projects contribute a large number of datasets, which are usually maintained and updated collaboratively by developers and researchers. Examples include GitHub, ModelScope, and HuggingFace.
When choosing public datasets, the following criteria are key considerations:
Here are some commonly used platforms and resources for public datasets:
The application of public datasets in specific projects can significantly enhance the efficiency and effectiveness of data collection and model training. Here are some practical use cases:
Natural Language Processing Tasks: In natural language processing (NLP) tasks, text datasets from Kaggle can be used to train language models. For example, using the Quora Question Pairs dataset to train a question-answering system that improves answer accuracy by identifying and categorizing similar questions.
Computer Vision Tasks: In the field of computer vision, image datasets from the UCI Machine Learning Repository can be used to train image classification or object detection models. For instance, training an image classification model using the CIFAR-10 dataset to identify and categorize everyday objects.
These practical examples demonstrate the widespread application and importance of public datasets in various projects. Proper selection and use of public datasets can effectively enhance the training outcomes and application value of models.
During the pre-training phase, LLMs learn a broad range of knowledge from vast amounts of unlabeled text data, then store this knowledge in their model parameters. This endows LLMs with a certain level of language understanding and generation capability.
Text from web pages, academic materials, books, and texts from various fields such as legal documents, annual financial reports, medical textbooks, etc., are sources for pre-training corpora.
We typically categorize pre-training datasets into two types: general datasets and domain-specific datasets.
It's noteworthy that incorporating math and code data into pre-training significantly enhances the model's reasoning capabilities.
Books are a rich source of unannotated pre-training data. If considering using books as pre-training data, it inevitably requires researching book classifications, allowing us to easily select relevant book data to strengthen the model's capabilities in specific domains.
Refer to the article by Teacher Ruan Yifeng: https://www.ruanyifeng.com/blog/2007/01/classification.html
Books can be classified as follows:
1. Chinese Library Classification (Fourth Edition) (CLC)
A Marxism, Leninism, Mao Zedong Thought, Deng Xiaoping Theory B Philosophy, Religion C Social Sciences General D Politics, Law F Economics G Culture, Science, Education, Sports H Language, Writing I Literature J Art K History, Geography N General Natural Sciences O Mathematical Sciences and Chemistry Q Biological Sciences R Medicine, Health S Agricultural Sciences T Industrial Technology U Transportation V Aviation, Aerospace X Environmental Science, Safety Science Z Comprehensive Books
2. Library of Congress Classification (LCC)
A - GENERAL WORKS B - PHILOSOPHY. PSYCHOLOGY. RELIGION C - AUXILIARY SCIENCES OF HISTORY D - HISTORY (GENERAL) AND HISTORY OF EUROPE E - HISTORY: AMERICA F - HISTORY: AMERICA G - GEOGRAPHY. ANTHROPOLOGY. RECREATION H - SOCIAL SCIENCES J - POLITICAL SCIENCE K - LAW L - EDUCATION M - MUSIC AND BOOKS ON MUSIC N - FINE ARTS P - LANGUAGE AND LITERATURE Q - SCIENCE R - MEDICINE S - AGRICULTURE T - TECHNOLOGY U - MILITARY SCIENCE V - NAVAL SCIENCE Z - BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
3. Dewey Decimal Classification (DDC)
000 Generalities 100 Philosophy & psychology 200 Religion 300 Social sciences 400 Language 500 Natural sciences & mathematics 600 Technology (Applied sciences) 700 The arts 800 Literature & rhetoric 900 Geography & history
Instruction fine-tuning datasets are used to further fine-tune pre-trained large language models (LLMs), enabling the model to better understand and follow human instructions. This process helps bridge the gap between the LLMs' next token prediction goal and the objective of following human instructions, thereby enhancing the capabilities and controllability of LLMs.
SFT datasets can be categorized into two types according to the type of instruction: QA Consultation datasets and Agent Subtask datasets.
These datasets do not have specific instructions but require the model to respond in a consultative manner, defined as QA Consultation datasets, such as:
{ "instruction": "", "input": "I got hit by a car when I went out today, what should I do?", "output": """I'm very sorry to hear that, and I hope you weren't seriously injured. Here are some steps you might need to consider: Seek medical help: Even if you feel that the injuries are not severe, sometimes internal injuries may not be immediately apparent. It's best to go to a hospital or clinic for a comprehensive examination. Call the police: If you haven't already, you should contact the local police immediately so they can come to the scene and make a record. This is very important for any possible claims or legal procedures later. Document evidence: If possible, take photos to document the accident scene, vehicles, injuries, etc. If there are witnesses, try to get their contact information. Contact your insurance company: Notify your insurance company about the accident as soon as possible and follow their guidance for the next steps. Legal consultation: Consider consulting a lawyer to understand your rights and possible compensation. I hope you recover quickly, and if you need further help or information, let me know anytime!""" }
These datasets do not require the model to complete specific tasks, but only to learn the logic and style of the output. It's worth mentioning that training the model on a large amount of domain-specific advisory datasets, such as medical consultation, will make the model's responses feature characteristics of the medical model.
Datasets that involve predefined task types through specific instructions, allowing the model to learn the expressions of respective tasks, are defined as agent sub-task datasets. Examples include:
{ "instruction": "You are an experienced multilingual translator, please translate the following text into ancient Chinese poetry, ensuring faithfulness, expressiveness, and elegance", "input": "Of course I still love you", "output":"纵使万里云舒,吾心犹存" }, //Tips: from Elon Musk, SpaceX { "instruction": "Please ask further questions regarding the following legal issue to obtain more relevant information:", "input": "My father married my stepmother and they had a son. When my father fell ill, he left all his property to his son. My stepmother says that since we both cared for the elderly together, how can she protect her rights?", "output": "1. Do you have any written or verbal agreements with your father and stepmother regarding the distribution of property and the care of the elderly?\n2. Do you have evidence that you contributed to the care of the elderly, such as paying for medical expenses or providing daily support?\n3. Do you have any other evidence that suggests an unfair distribution of property by your father, such as a will or property transfer records?\n4. Have you communicated or negotiated with your stepmother to resolve this issue?\n5. Have you consulted a lawyer or other legal professional to understand your rights and possible ways to protect them?" }, { "instruction": "Identify the legal key terms from the following information:", "input": "Liability determination for losses caused by unauthorized transactions on a stolen bank card", "output": "Bank card disputes; counterfeit card transactions; liability allocation; civil litigation; contract;" },
InstructGPT-sft (Ouyang et al., 2022) categorized part of the dataset into ten categories, including Generation, Open-ended QA, Brainstorming, Chat, Rewrite, Summarization, Classification, Other, Closed-ended QA, and Extraction.
BELLE train 3.5M CN (BELLE Group, 2023) built on this foundation by adding categories such as Role-playing, Math, Translation, Coding, and Harmless, while removing the Chat and Other categories.
Firefly (Yang, 2023) further refined the instruction categories, covering 23 categories. For example, subcategories like Story Generation and Lyrics Generation fall under the original "Generation" category. Considering the current classification status and focusing solely on single-turn dialogue instructions, instructions are broadly divided into 15 categories: Reasoning, Math, Brainstorming, Closed-ended QA, Open-ended QA, Coding, Extraction, Generation, Rewrite, Summarization, Translation, Role-playing, Social Norms, and Others.
Observing the above task categories, we can draw an interesting conclusion: the key to training domain-specific large models is to refine the agent sub-task dataset as much as possible, then choose a general-purpose model with decent capabilities along with domain-specific fine-tuning (SFT) datasets to create a domain-specific large model that appears capable of solving all problems within that domain.
DataTager is a product designed to address the creation of SFT datasets.
In the data collection process, collaborating with partners is a highly effective way to obtain high-quality, customized data. Such collaborations can provide highly relevant samples for specific projects, ensuring data quality and reliability.
Collaborating with partners for data collection offers several key benefits:
When selecting partners, consider the following criteria to ensure smooth cooperation and achieve the expected outcomes:
Here are some successful examples of partner data collection, along with challenges encountered and solutions:
The term "confidence" is quoted because it is a name I coined myself based on my experiences since starting my business. In fields such as law, healthcare, education, and finance, even if we aim to create consumer-oriented products that are useful to the general public, users may still have concerns after consultations regarding the endorsement of responses by well-known organizations and institutions. Endorsements from respected institutions like Union Hospital, Red Circle, Luo Xiang, or Magic Square can significantly boost user trust. I refer to this as "confidence."
By closely collaborating with partners, high-quality, customized data can be acquired, significantly enhancing the training outcomes and practical applications of models. Furthermore, partnerships with well-known institutions can increase user trust in products ("confidence"), laying a solid foundation for the success of consumer products in the market. When selecting partners and addressing challenges in the collaboration process, it's important to consider factors such as reputation, data relevance, and technical capability to ensure smooth cooperation.
Crowdsourcing platforms are an effective method of collecting and processing data through the power of the masses. In this model, tasks are assigned to a large number of online workers who complete the tasks and provide feedback. Crowdsourcing data collection has significant value in the training of large language models.
Crowdsourcing Platforms | |
---|---|
Advantages | Data Diversity: Can collect data from workers worldwide, ensuring diversity and broad coverage. Rapid Collection: Can complete large data collection tasks quickly. |
Disadvantages | Data Quality Control: Data quality may vary due to the diverse backgrounds and capabilities of workers. Management and Coordination: Managing and coordinating a large number of workers' tasks requires extra effort. |
Here are some commonly used crowdsourcing platforms, which have extensive experience and broad applications in data collection and processing:
Figure Eight (formerly CrowdFlower): Figure Eight is a professional crowdsourcing platform that offers services such as data labeling, cleaning, and verification, widely used in machine learning and artificial intelligence projects.
Appen: Appen is a well-known crowdsourcing platform that specializes in providing high-quality data collection and labeling services, particularly excelling in language data and image data processing.
Crowdsourcing platforms are extensively used in the training of large language models, and here are some actual use cases:
Crowdsourcing platforms provide robust support for data collection and annotation in the training of large language models. By effectively utilizing these platforms, diverse, high-quality data can be efficiently acquired, providing a solid foundation for the training of large language models. Additionally, the rapid response and large-scale data processing capabilities of crowdsourcing platforms meet the high data demands of large language model training.
During the data collection and processing stages, selecting the appropriate data storage format is crucial to ensure data availability and processing efficiency. Common data storage formats include JSON, JSONL, CSV, and XML. Other formats like Parquet, Avro, and HDF5 are also used, each with its own advantages and disadvantages.
Format | Readability | Parsing Efficiency | Storage Space | Other Features |
---|---|---|---|---|
JSON | Easy to read and write, clear structure | Moderate parsing speed | Larger than CSV, but smaller than XML | Easy to use, widely supported |
JSONL | Easy to process line by line | Fast parsing speed | Moderate | Suitable for large-scale data processing |
CSV | Simple and easy to read, but lacks structural info | Fast parsing speed | Smallest | Easy to handle, widely used |
XML | Good readability, but verbose | Slow parsing speed | Largest, includes a lot of tag information | Strong self-descriptiveness, supports complex structures |
Parquet | Poor readability | Fast parsing speed, suitable for columnar reading | High compression efficiency, occupies less space | Columnar storage, suitable for big data analysis |
Avro | Average readability | Fast parsing speed | High compression efficiency, occupies less space | Supports data schema evolution, suitable for stream processing |
HDF5 | Poor readability | Fast parsing speed | Efficient storage of large-scale data | Supports complex data structures and multidimensional arrays |
When choosing a data storage format, it is necessary to weigh based on the specific use case and requirements. Here are some practical usage recommendations:
By wisely choosing data storage formats, data processing efficiency can be significantly enhanced, ensuring data readability and maintainability. Selecting the most appropriate data format based on specific application scenarios and requirements helps improve overall data management and utilization.
During data collection, it's crucial to pay close attention to data privacy and compliance, while also ensuring data quality and consistency. Here are some key considerations.
During data collection and processing, it is mandatory to comply with relevant data privacy regulations to ensure the legality and protection of user privacy.
Data cleaning and preprocessing are crucial steps to ensure data quality, removing noise and errors, and enhancing data accuracy and consistency.
Deduplication and standardization are important steps in data collection to avoid redundancy and inconsistency.
By adhering to data privacy regulations, performing effective data cleaning and preprocessing, and conducting deduplication and standardization, data quality and consistency can be significantly enhanced, laying a solid foundation for subsequent data analysis and model training.
https://x.com/vikingmute/status/1737296571354743268
https://timconnors.co/posts/ai-scraper
https://zhuanlan.zhihu.com/p/692683234
https://zhuanlan.zhihu.com/p/669816705
https://zhuanlan.zhihu.com/p/693096151
For starting a proxy pool using Redis: https://github.com/Python3WebSpider/ProxyPool
https://blog.csdn.net/gyt15663668337/article/details/86345690
A simple introduction to Scrapy: https://vip.fxxkpython.com/?p=5038