How do we meaningfully measure human society in the 21st century? | 5Y View

五源资本·October 28, 2021·26·0

Mining meaning from data.

In the digital age, our social interactions, work, and daily lives generate massive amounts of data every moment. How do we extract meaning from this data to guide our understanding of ourselves and the world?

Today's article comes from Nature. The authors argue that we need to build new ways of measuring things — making the immeasurable measurable. Hope you find it thought-provoking :)

Reprinted from: Swarma Club

Authors: Lazer, D., Hargittai, E., Freelon, D. et al.

Translator: Hanjing Wang

Paper title: Meaningful measures of human society in the twenty-first century

Abstract

Science rarely outpaces what scientists can observe and measure, but sometimes observation far outstrips scientific understanding. The 21st century offers such a moment for the study of human society. Human behavior is observed today on a scale unimaginable at the end of the 20th century. Our social interactions, movements, and many everyday activities are potentially available for scientific study — sometimes through instruments purpose-built for research goals (e.g., satellite imagery), but more often these goals are afterthoughts (e.g., Twitter data streams).

In this article, we assess the potential of this massive instrumentation — technologies that create structured representations and quantifications of human behavior — through the lens of scientific measurement and its principles. We are particularly concerned with how to extract scientific meaning from data that were often not created for these purposes. These data pose conceptual, computational, and ethical challenges that require a renaissance in our scientific theories to keep pace with rapidly changing social realities and our capacity to capture them. In other words, we need new approaches to managing, using, and analyzing data.

The application of sensor technologies has multiplied across many domains of human activity, from automobile tracking devices to online browsing. Satellites regularly scan and digitize the Earth. Advances in techniques developed by computer scientists for processing unstructured data — such as text, images, audio, and video — have made it possible to convert artifacts like books, broadcasts, and television programs into data. In the 21st century, human behaviors — from mobility to information consumption to various forms of social interaction — are increasingly recorded and potentially computable. Past communication technologies, from mail to print to fax, typically left behind few durable, accessible artifacts; in the past decade or so, as related physical objects have been digitized, they too have become computationally accessible. The digitization of books is one example, enabling computational analysis of large corpora of human expression stretching back centuries.

These new data streams are often compared to the development of the telescope. As Robert Merton (Nobel laureate in Economics, 1997) famously put it: "Perhaps sociology is not yet ready for Einstein because it has not yet found its Kepler..." Merton's provocative remark referred to sociology's lack of an empirical foundation for great theory. To this, Duncan Watts (computational social scientist) responded 62 years later: "...the technological revolution in mobility, networks, and internet communication has the potential to revolutionize our understanding of ourselves and how we interact with the world by rendering previously unmeasurable variables measurable. Merton was right: the social sciences still haven't found their Kepler. Alexander Pope, the 18th-century English poet, proposed three centuries ago that the proper study of mankind is not in the heavens but of man, and now we have finally found our telescope."

We believe that digital data have the potential to transform the social sciences. However, the metaphor of instrumented society's data streams as a "telescope" is misleading in important ways. First, the study of society differs from the study of stars because patterns characterizing human behavior typically vary across time and place. Second, the measurements built from these data streams are suspect because these data sources were not created with scientific goals in mind and therefore must be actively scrutinized. We address the first point now; the rest of this article discusses the second.

1. The Unstable Logic of Society and Measurement

Empirical social science has focused primarily on discovering generalizable rather than universal patterns in human behavior. The portions of social science aimed at discovering such universal patterns from human behavior (e.g., evolutionary psychology) are trivial relative to the field as a whole. The problem of instability in the rules governing human society is exacerbated by the sociotechnical systems that collect data on people, which are actively (and in some cases intentionally) changing the social world that social science will study. Through what social scientists call reflexivity and self-fulfilling prophecies, humans actively change the world they observe by acting on the knowledge they gain (in part through instruments).

Reflexivity refers to the cycle that links social reality with the theories and measures we design to explain it. For example, in the analysis of electoral politics, "bandwagon" and "underdog" effects have long been identified to explain how polls and predictions influence voting behavior. If a candidate is predicted to be a likely winner, more people may decide to vote for them (bandwagon effect); or conversely, more people may mobilize to increase support for the candidate predicted to lose (underdog effect). These effects reflect the impact of measurement on attitudes and behavior, and how our measures distort the very phenomena they were designed to monitor. In turn, algorithmic decision-making in public health, law enforcement, sentencing, education, and hiring amplifies these distortions.

Reflexivity also manifests as observer effects, where people change their behavior when they know they are being watched. Digital technologies create new versions of the reflexivity problem, amplifying the performative aspects inherent in social indicators. When Google launched its Flu Trends Project in 2008, the goal was to use search queries to estimate the prevalence of flu symptoms in the population. However, in 2013 the project substantially overestimated peak flu levels. One reason was the flawed assumption that search behavior was driven by external events (such as flu symptoms). In fact, Google's algorithms were also driving these patterns: by attempting to predict user intent through suggested search terms, Google distorted the information users would otherwise have seen. In other words, the response to the observed phenomenon changed the phenomenon itself.

Obfuscation strategies represent another version of observer effects: we can now interfere with data collection by deliberately adding ambiguous or misleading information, thereby disrupting measurement. Examples of obfuscation include editing profile photos to prevent facial recognition; using virtual private networks (VPNs) to hide one's location while browsing; or using group identities (e.g., multiple people sharing one user account) to mask details of individual user behavior. Here the reflexive loop arises because awareness that behavioral traces are fed back into measurement and surveillance leads people to intentionally alter the meaning of that behavior. On a larger scale, this is analogous to respondents lying to survey researchers. And because the skills required to understand the surveillance occurring and how to implement obfuscation to address it are not randomly distributed in the population, the individuals who alter data in this way will not be random either.

The often unobtrusive nature of many digital measurements suggests that, on balance, observer effects may be less problematic with these new data sources than in the past — for example, the gender, age, and race of the person conducting an interview can dramatically change the answers respondents provide. However, the cycle linking social reality with the measures we design to analyze it has been strengthened — reflexivity is now embedded in the very tools used to monitor and predict human behavior. It is as if the Hubble Telescope were organizing the positions and behaviors of stars while observing them. For example, social media not only capture human behavior but also have the potential to alter important patterns of human society: the speed of information flow, the scope of media production, and the actors responsible for defining public opinion.

The principles organizing human society are fluid, and therefore the meaning of a particular measure also evolves. Part of why social science must adapt to new types of data is that emerging sociotechnical systems are diminishing the importance of some older scientific tools for measuring human behavior. Existing measures of key concepts such as gross domestic product (GDP) and geographic mobility remain constrained by the strengths and weaknesses of 20th-century data. If we only evaluate against old measures, we merely replicate their flaws, mistaking the gold standard of the 20th century for objective truth. Consider, for example, a standard question from the American National Election Studies [17] about radio consumption of election content (originally from 1978): "Would you say you listened a great deal, a fair amount, very little, or not at all to speeches or discussions about the 'elections' on the radio?"

This structure of "media consumption" as a limited set of discrete units is a product of broadcast-era technology. The question has little to do with how people access digital media today. Attempting to capture social media behavior by asking questions like "How many tweets did you see today?" or "Which Twitter accounts appeared in your feed?" is futile. Many methods for measuring behavior developed in the early days of quantitative social science were: (1) necessary given the measurement constraints of the time; and (2) grounded in a demonstrably different social reality.

Figure 1. Measurement in social science. Measurement serves as the bridge connecting scientific motivation, data and insight, and application.

Figure 1 summarizes how measurement fits into the general scientific process. Below we discuss the core challenges of transforming data from these sociotechnical systems into scientific measurement. This discussion will include two illustrative examples of data streams that underpin much social science research: location data from mobile phones and social media posts on Twitter. The key question we now address is what objects currently need to be measured with this massively instrumented human behavior, noting the key principles of measurement summarized in Box 1.

Box 1 | Core Principles of Measurement

Measurement should follow the definition of important questions

Measuring observed phenomena is predicated on identifying relevant questions. Important questions are driven by research questions, which may be motivated by normative goals, theoretical debates, or empirical puzzles.

Measurement must be actively constructed from data

Instruments designed for research purposes often produce scientific data. But data collected for purposes other than scientific research are also frequently repurposed by scholars. Data do not inherently mean anything that would make them measures of some theoretical construct; they must be transformed through various methods so that they systematically relate to each other and to scientific theory.

Scientific measurement follows the above principles in an evolving cycle

Scientific motivations guide researchers in designing data collection protocols, using third-party data, or developing some combination of the two. In their raw format, data provide observations that are processed into measures, which can then be used to test preconceived hypotheses (in a deductive manner) or to derive new hypotheses from exploratory analysis (in an inductive, data-driven manner). These deductive and inductive analyses aim to provide insights that feed back into scientific motivations, inform policy interventions, or more broadly, advance basic and applied research.

2. Tracking What Data Measure

The purpose of measurement using behavioral trace data is to extract meaning from the raw data produced by instruments. All scientific data instruments face this problem, but when we use data recovered from systems designed for other purposes, the gap from raw data to meaningful measure is often particularly wide. For example, unprocessed mobile phone movement data reporting specific latitudes and longitudes is largely uninteresting, while processing such data enables us to measure proximity, mobility, and other socially relevant concepts.

The key challenge is whether our measures accurately capture the constructs we want to examine. Do they align closely with other measures of the same object? What are the potential deviations between construct and concept (for example, if measuring physical phone activity, how important are the static activities that get missed, like treadmill use)? When we examine constructs that are assumed to be unrelated, do our measures reflect the expected absence of association? Overall, observational data in the 21st century was not designed for research, and before we can use it to answer scientific research questions, we need to connect these observations to known concepts.

The meaning of measures comes partly from theory. Theory-driven design that applies existing knowledge to interpret digital signals can overcome many problems of using instrumented behavioral data. Conversely, ad hoc operationalization lacking theoretical grounding makes research findings difficult to interpret and inconsistent across different studies. As noted earlier, formal theory not only helps generate hypotheses but also helps select an appropriate way to measure constructs with big data.

For example, consider using mobility data to study the spread of COVID-19. Multiple studies used real-time travel data to track population movement from Wuhan to other provinces in China. Researchers found that population outflow from Wuhan was strongly predictive of whether the coronavirus entered a region, allowing local disease control personnel to predict subsequent viral spread. In these studies, there was a well-theorized process with the assumption that viral spread was driven by individual proximity. The chosen theoretical framework in turn indicates how generalizable these findings might be to other cases. That is, we might expect similar patterns in the United States but not in Australia, because Australia implemented strict testing and quarantine procedures for travelers. Any given empirical finding is necessarily local in time and space; theory is needed to appropriately transport any measure to new geographic or temporal contexts.

As we conduct more research using large, complex data sources and formats, methods that provide insight into the validity of new measures become particularly valuable. One promising approach combines study of classic, validated self-report scales with new methods measuring related concepts. For example, self-reported news attention and exposure can be combined with eye-tracking to capture visual attention to online content. Such triangulation-like measurement methods also help confirm the validity and robustness of new behavioral constructs. Researchers have used phone data to design a proximity-based measure to record when people are near each other. These indicators can serve various useful purposes: they can function as indicators of relationship strength or as a method for tracking viral transmission pathways. But there is potential for error: for example, two people whose Bluetooth beacons show their devices in proximity might be separated by a wall, or might simply be charging their phones from the same outlet. In such cases, triangulation can be achieved by incorporating self-reported data, such as sending a message to someone's phone asking who else was nearby.

For internet-based research, relatively little is known about the basic population characteristics and underlying mechanisms of user behavior on digital platforms. Many basic concepts remain difficult to measure, even on online platforms that provide researchers with convenient data access. Despite thousands of papers based on Twitter data in recent years, social media scholars still find that identifying the demographic characteristics of individual users remains a massive challenge. Moreover, researchers still cannot reliably distinguish humans from non-humans (e.g., bots, collective accounts, or organizations), though important progress has been made in this area. Consequently, most Twitter research makes inferences about accounts or tweets; few Twitter studies can reasonably claim to be making statements about human behavior. For research questions focused on human behavior on Twitter, methods that link user accounts to administrative data or survey results show promise for identifying humans and their demographic attributes on Twitter.

Even when humans are the source of particular behaviors, attributing specific behaviors to specific individuals can present challenges. For example, in the early days of broadcast television, audience research encountered the challenge of multi-member households. Data from these cases would suggest that someone simultaneously liked children's cartoons and cable news, when in fact two different individuals were involved. Thus, technical devices can be misleading when behaviors are shared between people (two individuals using the same Netflix account) or across devices (the same person viewing Twitter on smartphone and desktop). Compounding this problem, device-person mismatches may evolve rapidly over time. For example, research based on desktop browsing data suggested that news consumption had systematically changed, when this may simply have been a product of the gradual shift of news consumption from desktop browsers to mobile apps. The lack of stability in how people use these different systems (and devices) may make such comparisons over time essentially impossible.

Using models based on other data can facilitate measurement of specific behaviors. For example, a particular device used by someone can be modeled from other data, and the output of this model is less sensitive than discrete assumptions about the identity of the device user. A cable news viewer might be a grandparent, while an Xbox user might be a grandchild. However, the data included in these models always comes from the past, and the relationships between measures are themselves unstable. This is the fundamental problem of induction, which cannot be overcome without a metaphysical revolution; we suggest that continually updated measures and models should represent our best improvement on this problem. That is, we should plan for the biases in what we measure and conduct ongoing assessment of how measures specifically reflect current social reality. For example, measures of inflation need to assess how the basket of goods people consume changes over time. This is a useful recalibration, though it also illustrates the limitations of this approach, because the emergence of entirely new products (for example, no one was buying smartphones in 2000) makes consumption inherently incomparable across time.

Driven by the internet, developments in communication technology have also led to the fragmentation of behavior into separate data silos. Consider a research question: Is "synchronous voice-mediated communication at a distance" important for reducing feelings of social isolation? Over the past half-century, this behavior has continually fragmented into different systems, from government-sanctioned monopolies (e.g., AT&T in the United States) to oligopolies, to countless internet providers. Moreover, it is quite plausible that the data captured by these systems contain systematic biases—anyone you talk to by phone may be systematically different from someone you talk to via Zoom, Skype, or WhatsApp. Even the tortuous linguistic structure used above reflects sociotechnical complexity: not long ago, "synchronous voice-mediated communication at a distance" would simply have been described as "a phone call." An important consequence of this technological fragmentation is that measurements relying on a single digital device or service should be interpreted with considerable caution. The answers we find may differ from those we would obtain by measuring behavior in similar but different technologies. Ironically, because of this complexity, simple survey questions may better capture whom a person typically talks to than records from a single platform.

Conversely, seemingly similar behaviors observed in different data silos may actually capture quite different phenomena. Just as various name generators used in surveys to produce contact lists can identify different social ties, friends on Facebook, followers on Twitter, or contacts on LinkedIn do not represent the same relationships. Moreover, although there are likely some strong statistical connections between these concepts, none of these relationships represents "friends" as used in everyday or scientific language. Furthermore, these systems change over time, as does what they allow users to do. This in turn indicates that the causal processes behind our online social behaviors, relationships, and structures are constantly changing. We must therefore now be aware of the systematic changing nature of measures, such as temporal validity and cross-system validity. The challenge then becomes developing measures that provide some degree of generalizability across time or systems for a given research question.

Another deep problem is algorithmic confounding of measures. Here, confounding refers to our inability to distinguish signals representing typical human behavior from signals produced by the rules of digital platforms. Without knowing how systems are designed, we easily attribute social motivations to behaviors driven by algorithmic decisions. If Twitter's feed suddenly begins prioritizing sports content, users may be found to have learned who won the Olympic games without any underlying change in their interest in sports. Such changes are often difficult to detect, both because they are sometimes introduced without notice and because they may roll out unevenly, affecting certain user groups first. This mechanism also operates in more subtle ways, such as how algorithmic prompts amplify natural human tendencies. For example, if Twitter systematically suggests you follow back those who already follow you, this can promote our natural tendency to reciprocate social relationships [36]. More generally, internet companies aim to manipulate human behavior to increase engagement on their platforms (such as Facebook, Twitter, and Instagram) or spending on their products (such as Amazon and eBay). These machine learning-based manipulations are ubiquitous, and any effort to develop measures from platform data needs to assess the degree to which algorithms distort measures and downstream analyses. Given their importance, these algorithms merit further study.

Although an in-depth discussion of causal inference is beyond the scope of this paper, we should note that some of the measurement issues identified here pose a particular problem for research aimed at establishing causality. For example, lack of stability in measures over time may lead researchers to attribute changes in focal outcomes to unrelated external events. The earlier discussion of Google Flu Trends is also relevant here. In that case, there was an implicit assumption of a causal relationship between flu and flu-related Google searches. However, if around 2013 Google prompted flu-related searches during flu season because it inferred from its complex algorithmic mechanisms that it was flu season, then measuring the same behavior in 2013 and 2008 would mean quite different things.

The plasticity of human expression and language also poses a general challenge to inferring attitudes and opinions from language and image data. It is well known that emotional expression on Twitter is difficult for computers to decode because it is often confounded by sarcasm, irony, and hyperbole. The severity of the problem depends on the structure of the noise, and equally on what matters—that is, the research question.

3. What Trace Data Measures

Human behavior is a multilevel concept that often requires individual-level measurement to infer the distribution of behaviors, attitudes, and attributes at the collective level. Research questions should clarify what the target population is for a particular study. These populations can include various types of people, or can be specific to a geographic area (city or country), a particular community (interest group or company), or countless other subgroups (youth, immigrants, or politicians). In particular, when dealing with entire population groups, it is neither logistically nor financially feasible to collect data on everyone. In such cases, researchers are better off collecting data on random samples, meaning that every member of the population has an equal probability of appearing in the sample. This ideal has never been fully realized, and in the real world—with survey response rates below 10% and uneven accessibility of people through different recruitment methods—it is even less relevant.

For system-level data, one might assume that everyone is represented in the data because all users' actions are in the dataset. However, in this case, the sampling targets users of the system that collects the data and its most active members. This is at best a "convenience census" of the platform under study, not a census of the entire population. If the scientific goal is to make claims about people on the platform, this census may be compelling. Yet any generalization beyond this platform must be viewed more critically. This is a particular problem for Twitter research, the most frequently cited emerging data source, even though only about 20% of Americans use it and it is even less popular in most other countries. Importantly, social media platform users do not reflect the demographic characteristics of internet users overall, nor other attributes such as interests. Given recent advances in promoting the representativeness of research populations in other domains, these issues must be carefully considered in the social media sphere. We also note that methods for recalibrating data to make reasonable population-level inferences may be especially powerful when applied to large-scale data.

The generalization problem is amplified when studying only a subset of platform users. The key question is whether and how the nature of the sample affects inference. For this reason, for example, a study of Twitter users who include their name and location in their profiles raises the question: Do these findings apply to Twitter users who do not disclose these details? Similarly, another study investigated patterns of political information consumption among the small number of Facebook users who provide partisan labels in their profiles, but do the findings apply to individuals who do not reveal their political leanings? By social science standards, these studies have relatively large sample sizes, but this does not alleviate concerns that the sample is not representative of the people using the platform. Over time, the people using a platform can change dramatically (Facebook was once the exclusive domain of Harvard undergraduates), which exacerbates this problem—in such cases, these demographic shifts themselves also affect what happens on the platform.

Other critical questions about generalizability include how different platforms elicit systematically different behaviors. For example, the same person often behaves differently on Facebook and Twitter. More generally, some human behaviors are highly context-dependent; if we could only observe the same person in work, family, or religious settings, we might draw completely different conclusions about human nature. Generalizability is a function not only of population but also of the specific observational environment. Depending on the research question, this may or may not be a problem. A well-defined question and population will help determine how well the measurement aligns with research intentions.

Finally, we note the critical measurement issue of what the systematic biases in sampling are. Generally, our data collection systems are biased away from minority groups, especially marginalized populations; moreover, our theoretical questions about populations typically focus on the middle of the distribution. Representativeness is an extremely important problem for understanding humans, both now and in the past. One study analyzed text from Google Books—the largest digitized collection of human knowledge—hoping to draw conclusions about how linguistic change in these texts over centuries aligned with shifts in national sentiment. This corpus is compromised as a representation of language use because its composition changed systematically over time (for example, scientific texts were much more represented in the twentieth century), and because even a carefully curated set of books reflects the reality of an unrepresentative elite. Even the largest library in history cannot discover those who, though unrepresented in published literature, were nonetheless capable of taking action and changing the course of history.

These representativeness concerns were a major focus of twentieth-century social science methodology. Contacting respondents by mail systematically excluded the homeless, telephone surveys excluded those without phones, and in-person surveys depended on people's comfort and trust in interacting with strangers in this way.

Streams of observed behavior may be subject to similar biases. First, the instruments that collect data are typically consumer goods owned by individuals (such as mobile phones or computers), so cost is a barrier. Second, these tools are usually driven by business models targeting affluent people. Third, when people opt out of privacy services, those who are more concerned about or knowledgeable about privacy issues may be underrepresented in behavioral tracking systems.

However, these data streams have some key compensating properties. Sensor technology may fill important data gaps, making visible those who would otherwise be erased from the map. For example, in the absence of household income and consumption surveys, satellite imagery has been used to build wealth and poverty indicators for the Global South. The pervasiveness of modern technology means that in many cases, representation is superior to traditional data collection mechanisms—owning a mobile phone is cheaper than owning a home. This has parallels to the administrative data W. E. B. Du Bois used to study African American individuals in the late nineteenth and early twentieth centuries. Data from an administrative state practicing racial hierarchy was certainly not neutral, yet still held critical value in providing visibility to those in the most precarious positions in society.

Moreover, large sample sizes allow us to observe behaviors in data subsets, such as minority groups (in the general sense) and statistically uncommon but consequential events (such as hate speech or misinformation), where sample size and our ability to amplify smaller populations and rare data points matter more than sample representativeness. As Pareto observed long ago, much of human behavior is concentrated in a tiny fraction of the population; yet twentieth-century methods were generally ill-suited to studying this social reality. Perhaps twenty-first-century social theory will be able to leverage micro-level behavioral data to understand how structures of interdependence produce certain macro-level patterns.

4. Measurement Access and Ethics

Compared to data from the Hubble telescope, emerging data streams from sociotechnical systems pose two additional challenges. First, the Hubble telescope is controlled by scientific institutions whose goals presumably involve answering scientific questions. The institutional goals of platforms like Twitter are obviously not to answer scientific questions. Thus, the first question is: What can be measured? Second, humans as research participants raise ethical questions that distant galaxies clearly do not. So the follow-up question is: What should be measured? We address these two questions in turn.

What can be measured varies greatly depending on the system generating the data. It is possible to design small-scale data collection systems that rely on consenting participants; however, accessing data on millions of people typically requires partnering with platforms. Internet-based communication data has wide availability, with access rules varying substantially across data holders and over time. At the least restrictive end, platforms like Reddit and Wikipedia allow access to nearly everything that end users can view in machine-readable formats. In contrast, companies like Facebook and Twitter offer far more restricted access mechanisms, limited by time, data volume, and not all publicly visible data being programmatically accessible. Notably, no major platform currently provides individual-level data on what people pay attention to, which is a very large gap in current internet-based measurement. Moreover, no platform provides broad access to information from randomized controlled trials (in the form of A/B tests), which could in principle allow inference about the effects of their algorithms on individuals. Generally, any private institution controlling data that researchers are interested in can, absent contrary provisions, set the terms of data access as it chooses. That platforms like Twitter and Facebook are themselves the focus of scientific questions of public concern (consider: Does a platform amplify the spread of misinformation? What are platforms' responses to hate speech?) makes this control a serious problem. A responsibility of academia in these areas is to provide the public with information about these important questions. One corollary of what can be measured must be: If the power being questioned controls access to the data used to construct "truth," is it possible to speak truth to power? If not, is it possible to trust any measurement that can be extracted from a given system?

Emerging data sources also bring new ethical challenges. We focus on those related to measurement, particularly what can and should be measured. Broader discussions of trace data ethics and alternative models of data access can be found elsewhere; here, we briefly introduce five especially pressing issues.

First, although informed consent is the foundation of human subjects research, anonymized data obtained by third parties is often not considered "human subjects data" and thus escapes institutional review board scrutiny. What are researchers' ethical obligations when considering data collection scenarios while collecting data? More generally, people may not know how different systems track them, whether through mobile phone movement data or browsing data. So what are the ethical guidelines for using third-party tracking data when the targets of such tracking are at best nominally aware of this fact?

Second, the level of detail in behavioral datasets means that reliable anonymization is often practically difficult or impossible for re-identification efforts. It is important to note that de-identified anonymous data can be either of a type that cannot be re-identified, or of a type that can be. Some methods have emerged around "differential privacy" that allow adding noise to datasets, thereby guaranteeing anonymity to some degree and making reliable re-identification possible. However, there is a trade-off here, because the privacy-enhancing noise addition reduces data utility. This is the approach adopted in the Social Science One project, which provided analytical access to Facebook data (Box 2). One of the dilemmas facing teams granted access was whether the resulting data retained enough value to answer their questions (note: some authors participated in Social Science One and the Facebook 2020 Election Research Project).

Third, for publicly visible behaviors such as those on Twitter, what privacy expectations are reasonable? What obligations must researchers fulfill to obscure these behaviors? For example, when should researchers avoid mentioning information such as usernames and full social media messages (in publications or presentations) because they might attract negative attention or harassment? Some argue that automatically anonymizing public data may not be the right approach either; instead, content creators' preferences should be consulted.

Fourth, reliance on the principle of individual autonomy is inherently limited for two reasons. In a world of networked information and insight, information that one person reveals to others frequently spills over. The function of online media, by definition, is to facilitate interpersonal visibility. For example, individuals sharing email data must provide information from other individuals. The Cambridge Analytica scandal illustrated the dangers of such networked information disclosure, in which individuals used a Facebook app that in turn provided access to behavioral data from those users' friends. However, the risk of information spillover is a more general principle that is not new to digital trace data: individual disclosure almost always carries potential for spillover. For example, one person's genetic data may provide clues about close relatives; and almost all data about an individual provides information about others. One person's response about political preferences can reveal information about other family members' preferences; information about one person's drug use can reveal potential drug use among that person's friends.

There is also intrapersonal information inference, in which provided information (possibly given with consent) enables inferences that the individual did not anticipate. The practical ethical outcome cannot be to prohibit all research where informational spillover or inference is possible; however, this does mean that data sharing and data visibility typically need to be subject to significant limitations. The importance of data security also needs to be emphasized.

Building on our earlier discussion of "objects of measurement," extreme caution must be exercised when attempting to generalize findings from trace-based research to populations beyond the study platform and to participants' actual lives. It is important to find ways to engage digitally underrepresented participants, especially when such research is used to inform broad social or corporate policy decisions.

Conversely, compared to traditional twentieth-century methods, when digital forms of measurement can better represent marginalized groups, our ethical obligation should be to use them, as emphasized by the satellite data example mentioned above. The choice facing society is not whether digital technology will be used to measure human behavior, but when, how, and whether anyone beyond corporate or state surveillance can access this data. Ideally, large-scale digital data sources would flow into measurements that inform detailed policies and targeted interventions rather than one-size-fits-all initiatives, which often work poorly for minority groups.

Finally, the field has a responsibility to critique decisions made based on problematic measurement steps in practice. A previously published study showing racial bias in an algorithm used by many hospitals—bias caused by measurement error—is an excellent example of both the dangers of flawed measurement in automated decision-making and the potential for good science to help correct these problems.

Box 2 Data Access and Ethical Issues

Possible Implications of Platform Control Over Data Access

Research tools may become outdated without warning due to platform changes in data access.
Private data holders may require external researchers to collaborate closely with them as a condition of data access. Moreover, such collaborative work may be subject to review by the private data holder before publication. Research conducted under such direct platform control cannot be a source of critical insight about the platform in question.
If researchers' work falls outside the data holder's interests, or if they are unwilling to collaborate directly with the data holder, they may resort to methods that violate platform terms of service.
Early paradigms for facilitating access to platform data while maintaining researcher independence include:

Social Science One: This work involved externally approved research grants for analysis of aggregate Facebook data to which differential privacy was applied.

The Facebook 2020 Election Research Project: This project involved collaboration between external researchers and Facebook, in which Facebook researchers conducted data analysis using pre-registered analysis plans and metrics defined by external experts, who also oversaw execution of the analyses and retained full interpretive authority over results.

Ethical Questions

What are researchers' ethical obligations when considering data collection scenarios (e.g., through leaks or hacking)?
How should the research community address trade-offs introduced by data anonymization techniques (e.g., through noise addition)?
What privacy expectations are reasonable for publicly visible behaviors (such as social media posts)?
How do we manage information spillover, where data collected from consenting individuals reveals information about others who are unaware or did not consent?
How do we ensure that marginalized groups are adequately and accurately represented in research?

5. Outlook

Box 3 summarizes the fundamental arguments of this paper. The vast instrumentation of global society holds enormous potential to transform our understanding of the social world. Yet the revolution in human behavioral instrumentation demands a revolution in human behavioral measurement. Any new measurement regime needs to match the possibilities of both old and new social theories, grapple with the inherently unstable nature of measuring humans in these highly instrumented sociotechnical systems, and develop a new model of ethical research with human participants that balances individual rights and collective benefits.

Box 3 Key Questions for Measurement

What matters?

We can measure many concepts. We must be explicit about the ideas, values, priorities, and principles that guide our choice of research questions, and how we construct a topic worthy of study.

What is the temporal, spatial, structural, and cultural integrity of measurement?

Superficially similar structures can be measured in very different ways, and the same metric can vary as the system measuring it changes, or be inconsistent across different geographic, demographic, and cultural groups.

Who is being measured?

People who choose to use various systems such as social media platforms are not random, nor is their choice of which parts of these systems to use or how actively to use them. Consequently, trace data from users of such systems—which forms the basis of much research—may not generalize to broader populations, or even to other seemingly similar platforms.

Who has permission to measure?

Science is constrained by access to data. This data is limited by the functions, purposes, and protocols of the institutions and technologies that create it. These constraints will never be fully resolved and should therefore be a central concern of the field.

What is ethical to measure?

Digital data touches more people than ever before. Information can be collected not only about people in a study, but also about people around them who are not themselves in the study. The information spillover problem is inherent to all research in a networked world; it undermines the fundamental principle of individual autonomy in current research ethics frameworks and greatly increases researchers' responsibility to maintain data security. Standards and practices regarding consent, privacy, and confidentiality must account for these realities.

References

Pechenick, E. A., Danforth, C. M. & Dodds, P. S. Characterizing the Google Books Corpus: strong limits to inferences of socio-cultural and linguistic evolution. PLoS ONE 10, e0137041 (2015).
Dietrich, B. J., Hayes, M. & O'Brien, D. Z. Pitch perfect: vocal pitch and the emotional intensity of congressional speech. Am. Polit. Sci. Rev. 113, 941–962 (2019).
Dietrich, B. J. Using motion detection to measure social polarization in the U.S. House of Representatives. Polit. Anal. 29, 250–259 (2021).
Michel, J.-B. et al. Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011).

In this study, 4% of all books that have been published were digitized and used to examine changes in phonology, word use and the adoption of new technologies over long periods of time.

Merton, R. K. in Social Theory and Social Structure 39–72 (Free Press, 1968).
Watts, D. J. Everything Is Obvious: Once You Know the Answer (Crown Business, 2011).
Simon, H. A. Bandwagon and underdog effects and the possibility of election predictions. Public Opin. Q. 18, 245–253 (1954).
Mutz, D. C. Impersonal Influence in American Politics (Cambridge Univ. Press, 1998).
Westwood, S. J., Messing, S. & Lelkes, Y. Projecting confidence: how the probabilistic horse race confuses and demobilizes the public. J. Polit. 82, 1530–1544 (2020).
O'Neil, C. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (Crown, 2016).
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Landsberger, H. A. Hawthorne Revisited (The New York State School of Industrial and Labor Relations, 1958).
Mayo, E. The Human Problems of an Industrial Civilization (Routledge, 2004).
Lazer, D., Kennedy, R., King, G. & Vespignani, A. The parable of Google Flu: traps in big data analysis. Science 343, 1203–1205 (2014).

This paper shows that the increasing over-prediction of flu prevalence of Google Flu Trends was largely the result of changes to Google's search algorithm, which altered the terms that people used to find flu-related information.

Brunton, F. & Nissenbaum, H. Obfuscation: A User's Guide for Privacy and Protest (MIT Press, 2015).
Davis, D. W. The direction of race of interviewer effects among African-Americans: donning the Black mask. Am. J. Pol. Sci. 41, 309–322 (1997).
American National Election Studies. 1978 Time Series Study https://electionstudies.org/wp-content/uploads/2018/03/anes_timeseries_1978_qnaire_post.pdf (1978).
Salganik, M. J. Bit by Bit: Social Research in the Digital Age (Princeton Univ. Press, 2017).
Patty, J. W. & Penn, E. M. Analyzing big data: social choice and measurement. PS Polit. Sci. Polit. 48, 95–101 (2015).
Kraemer, M. U. G. et al. The effect of human mobility and control measures on the COVID-19 epidemic in China. Science 368, 493–497 (2020).
Jia, J. S. et al. Population flow drives spatio-temporal distribution of COVID-19 in China. Nature 582, 389–394 (2020).
Badr, H. S. et al. Association between mobility patterns and COVID-19 transmission in the USA: a mathematical modelling study. Lancet Infect. Dis. 20, 1247–1254 (2020).
Munger, K. The limited value of non-replicable field experiments in contexts with low temporal validity. Soc. Media Soc. 5, 1–4 (2019).
Deaton, A. & Cartwright, N. Understanding and misunderstanding randomized controlled trials. Soc. Sci. Med. 210, 2–21 (2018).
Vraga, E. K., Bode, L., Smithson, A.-B. & Troller-Renfree, S. Accidentally attentive: comparing visual, close-ended, and open-ended measures of attention on social media. Comput. Human Behav. 99, 235–244 (2019).
Guess, A., Munger, K., Nagler, J. & Tucker, J. How accurate are survey responses on social media and politics? Polit. Commun. 36, 241–258 (2019).
Aleta, A. et al. Modelling the impact of testing, contact tracing and household quarantine on second waves of COVID-19. Nat. Hum. Behav. 4, 964–971 (2020).
Echeverría, J. et al. LOBO: evaluation of generalization deficiencies in Twitter bot classifiers. In Proc. 34th Annual Computer Security Applications Conference 137–146 (ACM, 2018).
Ferrara, E., Varol, O., Davis, C., Menczer, F. & Flammini, A. The rise of social bots. Commun. ACM 59, 96–104 (2016).
Hughes, A. G. et al. Using administrative records and survey data to construct samples of Tweeters and Tweets. Public Opin. Q. https://doi.org/10.1093/poq/nfab020 (2021).
Napoli, P. M. Audience Evolution: New Technologies and the Transformation of Media Audiences (Columbia Univ. Press, 2011).
Yang, T., Majó-Vázquez, S., Nielsen, R. K. & González-Bailón, S. Exposure to news grows less fragmented with an increase in mobile access. Proc. Natl Acad. Sci. USA 117, 28678–28683 (2020).

This study tracked the news consumption of users across mobile and desktop devices and found that most individuals do not self-sort their news consumption by partisanship but, instead, consume news from a diversity of sources including partisan and nonpartisan ones.

Haythornthwaite, C. Exploring multiplexity: social network structures in a computer-supported distance learning class. Inf. Soc. 17, 211–226 (2001).
Campbell, K. E. & Lee, B. A. Name generators in surveys of personal networks. Soc. Netw. 13, 203–221 (1991).
Wagner, C. Measuring algorithmically infused societies. Nature https://doi.org/10.1038/s41586-021-03666-1 (2021).
Healy, K. The performativity of networks. Eur. J. Sociol. 56, 175–205 (2015).
Rahwan, I. et al. Machine behaviour. Nature 568, 477–486 (2019).
Neuendorf, K. A. The Content Analysis Guidebook (Sage, 2017).
Davidov, D., Tsur, O. & Rappoport, A. Semi-supervised recognition of sarcasm in Twitter and Amazon. In Proc. 14th Conference on Computational Natural Language Learning 107–116 (Association for Computational Linguistics, 2010).
Groves, R. M. Nonresponse rates and nonresponse bias in household surveys. Public Opin. Q. 70, 646–675 (2006).
Hargittai, E. Potential biases in big data: omitted voices on social media. Soc. Sci. Comput. Rev. 38, 10–24 (2020).

Using survey data, this study finds that younger, wealthier and more technically skilled people tend to use social media and that there were substantial gender and education differences in which platforms people used.

Lazer, D. & Radford, J. Data ex machina: introduction to big data. Annu. Rev. Sociol. 43, 19–39 (2017).
Correa, T. & Valenzuela, S. A trend study in the stratification of social media use among urban youth: Chile 2009–2019. J. Quant. Descr. Digit. Media 1, https://doi.org/10.51685/jqd.2021.009 (2021).
Mellon, J. & Prosser, C. Twitter and Facebook are not representative of the general population: political attitudes and demographics of British social media users. Res. Polit. 4, 1–9 (2017).
Beisch, N. & Schäfer, C. Internetnutzung mit großer Dynamik: Medien, Kommunikation, Social Media. AS&S https://www.ard-werbung.de/media-perspektiven/fachzeitschrift/2020/detailseite-2020/internetnutzung-mit-grosser-dynamik-medien-kommunikation-social-media/ (2020).
Hargittai, E. & Litt, E. The Tweet smell of celebrity success: explaining variation in Twitter adoption among a diverse group of young adults. New Media Soc. 13, 824–842 (2011).
Henrich, J., Heine, S. J. & Norenzayan, A. Most people are not WEIRD. Nature 466, 29 (2010).
Wang, W., Rothschild, D., Goel, S. & Gelman, A. Forecasting elections with non-representative polls. Int. J. Forecast. 31, 980–991 (2015).
Grinberg, N., Joseph, K., Friedland, L., Swire-Thompson, B. & Lazer, D. Fake news on Twitter during the 2016 U.S. presidential election. Science 363, 374–378 (2019).
Bakshy, E., Messing, S. & Adamic, L. A. Exposure to ideologically diverse news and opinion on Facebook. Science 348, 1130–1132 (2015).
Meng, X.-L. Statistical paradises and paradoxes in big data (I): law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat. 12, 685–726 (2018).
Hargittai, E., Füchslin, T. & Schäfer, M. S. How do young adults engage with science and research on social media? Some preliminary findings and an agenda for future research. Soc. Media Soc. 4, 1–10 (2018).
Blumenstock, J. Don't forget people in the use of big data for development. Nature 561, 170–172 (2018).
Battle-Baptiste, W. & Rusert, B. (eds) W. E. B. Du Bois's Data Portraits: Visualizing Black America (Princeton Architectural Press, 2018).
Siegel, A. A. et al. Trumping hate on Twitter? Online hate speech in the 2016 US election campaign and its aftermath. Quart. J. Polit. Sci. 16, 71–104 (2021).
Allen, J., Howland, B., Mobius, M., Rothschild, D. & Watts, D. J. Evaluating the fake news problem at the scale of the information ecosystem. Sci. Adv. 6, eaay3539 (2020).
Foucault Welles, B. On minorities and outliers: the case for making big data small. Big Data Soc. 1, 1–2 (2014).
Newman, M. E. J. Power laws, Pareto distributions and Zipf's law. Contemp. Phys. 46, 323–351 (2005).
González-Bailón, S. Decoding the Social World: Data Science and the Unintended Consequences of Communication (MIT Press, 2017).
Stopczynski, A. et al. Measuring large-scale social networks with high resolution. PLoS ONE 9, e95978 (2014).
Lazer, D. Studying human attention on the Internet. Proc. Natl Acad. Sci. USA 117, 21–22 (2020).
Aral, S. & Eckles, D. Protecting elections from social media manipulation. Science 365, 858–861 (2019).
Puschmann, C. & Burgess, J. The politics of Twitter data. HIIG Discussion Paper Series No. 2013-01 http://www.ssrn.com/abstract=2206225 (2013).
Chen, W. & Quan-Haase, A. Big data ethics and politics: toward new understandings. Soc. Sci. Comput. Rev. 38, 3–9 (2020).
Breuer, J., Bishop, L. & Kinder-Kurlanda, K. The practical and ethical challenges in acquiring and sharing digital trace data: negotiating public–private partnerships. New Media Soc. 22, 2058–2080 (2020).
Zook, M. et al. Ten simple rules for responsible big data research. PLOS Comput. Biol. 13, e1005399 (2017).
Greenberg, A. An absurdly basic bug let anyone grab all of parler's data. Wired (12 January 2021).
Valentino-DeVries, J., Singer, N., Keller, M. H. & Krolik, A. your apps know where you were last night, and they're not keeping it secret. The New York Times https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html (10 December 2021).
Sweeney, L. Simple demographics often identify people uniquely. Privacy Working Paper 3 https://dataprivacylab.org/projects/identifiability/paper1.pdf (Carnegie Mellon University, 2000).

Using census data, this paper shows that 87% of the US population could be uniquely identified by date of birth, postal code and gender; demonstrating the ease with which study respondents can be re-identified from ostensibly anonymous data.

Wood, A. et al. Differential privacy: a primer for a non-technical audience. Vanderbilt J. Entertain. Technol. Law 21, 209–276 (2019).
Dwork, C. & Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–407 (2013).
King, G. & Persily, N. A new model for industry–academic partnerships. PS Polit. Sci. Polit. 53, 703–709 (2020).
Bruckman, A., Luther, K. & Fiesler, C. in Digital Research Confidential: The Secrets of Studying Behavior Online (eds Hargittai, E. & Sandvig, C.) 243–258 (MIT Press, 2015).
Marwick, A. E. & boyd, d. Networked privacy: how teenagers negotiate context in social media. New Media Soc. 16, 1051–1067 (2014).
Bieber, F. R., Brenner, C. H. & Lazer, D. Finding criminals through DNA of their relatives. Science 312, 1315–1316 (2006).
Zheleva, E. & Getoor, L. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In Proc. 18th International Conference on World Wide Web 531–540 (2009).
Miller, G. As U.S. election nears, researchers are following the trail of fake news. Science (26 October 2020).
Merton, R. K. The self-fulfilling prophecy. Antioch Rev. 8, 193–210 (1948).

(References can be scrolled up or down to view)

![](https://kkcktdhgzddmdjdjoenc.supabase.co/storage/v1/object/public/media/images/d8cec2de-1dc4-4dd1-a942-c53fb2daed09/dd1c8776bf7c.png

5Y Capital (formerly Morningside Venture Capital) currently manages approximately RMB 32 billion across USD and RMB dual-currency funds. 5Y Capital seeks out, supports, and inspires founders who are going it alone, providing everything from moral support to hands-on operational backing. We believe that if the person everyone else thinks is crazy starts to be believed, the world will become a different place.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG